This is a blog on the Greek language. That is why it is called Hēllēnisteúkontos, “From the guy who has been a scholar of Greek”. But I arrogate the right to post here about other linguistics stuff that I find of interest. I have a below-the-fold arrangement, so you can bypass it easily.

This post is on misuses of numerical methods in linguistics, as applied to Esperanto.

I am no longer in any honest sense an Esperantist, or a Lojbanist, or a Klingonist. (Or, more self-consciously, mi ne plu estas esperantisto, .i mi ca ba’o lobypli, ‘ej tlhIngan Hol vIlo’ ‘e’ vimevpu’.) Not out of malice, indeed with a good deal of regret, but that’s where life has taken me.

But I mention my learnings from those languages on occasion, and in The Other Place, I just drew an analogy between the language politics of Esperanto cultural functions and Acadian cultural functions. Someone found the posting by googling “Esperanto”, and that made be follow some links, that led to some links…

… that led to mention of a recent couple of articles using computational methods to compare the linguistic profiles an English and an Esperanto text, and come up with the conclusion that English and Esperanto were different. And then to make the extra conclusion that natural and artificial languages are different. Here’s the articles: #1, #2.

I am grateful that the slices of the Esperanto blogosphere I sighted mocked this study: sample 1, sample 2. And I’m going to go to town on this here, because it deserves mockery.

Gillet and Ausloos, you are idiots. Maybe not in Computer Science, but on my turf, you have committed grand folly. You have taken two data points, English and Esperanto; you have compared the profile of their word lengths and word frequencies, and have decreed they’re different. Fine, they’re different. That says less than nothing about a comparison between artificial and natural languages! In God’s name, put up a study with Inuit, Turkish, and Chinese on the one hand, and Esperanto, Klingon, and Lojban on the other, and *then* you might have something relevant to say.

English and Esperanto word lengths and word frequencies are different. Oh come on.

See, this is the problem with computer scientists doing linguistics as if linguistics never existed. Just load some texts into a Multifractal Analysatron 2000, churn some gears, and that will tells us something interesting about language. Well no it won’t, not if you’re asking the wrong question, and have no framework to make sense of the answer. It’s not that we can’t learn anything new from the Multifractal Analysatron; but without building on what we already know, you’re guaranteeing that what you do build will fall over. It was computer science people that came up with “Garbage In Garbage Out” after all.

I was in the library yesterday, for the sake of melancholy nostalgia, and to see what I could get on French-Canadian linguistics. I walked by Diachronica, and leafed through it to see what was new in historical linguistics. April McMahon, who wrote a wonderful textbook on language change 15 years ago, has just co-authored a new book on… numerical methods in historical linguistics. My heart sunk. It shouldn’t have, because April McMahon has earned my trust.

As the review said, one of the things McMahon points out in the book is, there is a regrettable tendency in numerical approaches to linguistics to just put the raw data into the Analysatrons, and see what happens. And she said, in a more measured and thoughtful way than I just did, that this is nonsense: a linguist still needs to make sense of the input, identify what correlations are worth pursuing, and filter out what methodologically needs filtering out.

I mean, word lengths and word frequencies? Even Plato had a more sophisticated understanding of language structure than that; and that’s not saying much.

There are some more details I’ll rattle off, with regard to word length in particular. Triggered by the fact that in their preliminary studies, the authors were surprised to find more similarity with German and Spanish, and least similarity with French and English.

If you’re surprised to find affinities between German and Esperanto, you know nothing of the history of Esperanto. And with just word length as your tool, and a comparable amount of inflectional morphology, I don’t know how meaningful the affinity they discovered is anyway.

But in particular, Esperanto is agglutinating, so it likes its words longer than an isolating language like Chinese or English (I think it’s fair by now to call English isolating). And Esperanto as a literary language was substantially influenced by German, because its most influential authors worked in the shadow of Prussia and the Austro-Hungarian empire, and German was a default model to them. (I’m thinking Ludovik Zamenhof and Kazimierz Bein in the first generation—Litvak Jew so culturally Russian, but with access to German; and Polish, respectively; and Julio Baghy and Kalman Kalocsay in the second—both Hungarians.)

The love of compounding is a way of dealing with the requirement to keep vocabulary minimal in an artificial language; but the choice of compounding rather than more analytical expressions is informed by German, not by interlinguistics. Not to mention the suite of compounds overtly calqued from German (verŝajna for wahrscheinlich “apparent”, for instance).

The second paper made the mistake of profiling sentence length, and that was even more boneheaded. Sentence structure in a literate language is decidedly influenced by cultural contact: all of Europe has the mark of Latin subordination on it. And again, Esperanto sentence structure did not happen in a vacuum: Esperantists emulated the examples of their teachers and writers, and the teachers and writers patterned after natural language models. Which again were substantially German.

When we talk about the “spirit” of a language, we’re normally not primarily talking about morphology and syntax. We’re talking about semantic maps, and discourse structures, and idioms. It’s not that intangible, it’s just somewhat harder to formalise than morphology and syntax. Inasmuch as the spirit of Esperanto has kindred out there, however tenuous, that kindred is German. But profiling word lengths and word frequencies is not going to tell you much about morphology and syntax. And it will tell you little more about discourse structures.

At any rate, why *would* Esperanto be so different to natural languages? Some regularisation in its inflectional morphology, sure; but isolating languages are even more regular, by not having any inflectional morphology at all. Agglutinativity, sure; but Turkish and Lakhota were agglutinative before Esperanto was. Ludovik Zamenhof was not Mark Okrand, easter-egging his language with violations of linguistic unievrsals.

The only quirk I can think of worth noting is Esperantists turning affixes into independent words. That quirk is artificial in origin: Zamenhof was supposed to say, in modern terms, that all morphemes of Esperanto are meaningful, and ended up saying that all morphemes of Esperanto are independent words. This has stuck: the diminutive suffix -et- is also the word for “tiny”, the object nominaliser -aĵ- is also the word for “thing”, the collective suffix -ar- is also the word for “grouping”. The trend has been taken far with successive generations of Esperantists, but was started by Zamenhof himself.

Yet even this is not alien to natural language. In fact, in its guise as degrammaticalisation, it was a favourite bone of contention between Lyle Campbell and Elizabeth Closs Traugott in the ’90s.

(Grammaticalisation theory claims that grammatical affixes come from particles and particles come from full words. So the suffix -like used to be the noun lich “body”. Degrammaticalisation is when the reverse happens; the canonical examples are from Estonian, but it also happens in English with up the ante: a particle—a preposition—turning into a verb. Is it an occasional exception under special circumstances? Or is it frequent enough to undermine the core premiss of grammaticalisation? Actually, that’s an ideological question, and it’s hard to resolve it one way or the other. Don’t know if anyone’s claimed victory.)

At any rate. Garbage On Garbage Out. Let that too be a lesson to… well, somebody.


  • JOSE says:

    You are right. Bravo!

  • opoudjis says:

    Ἡλλην-, because it's a perfect participle, and so takes a temporal augment as reduplication. People would in reality hesitate to augment a noun stem like this, but for ἑλληνίζω, which is actually a real verb, there are attested both augmented and unaugmented aorists. (The perfect is too recherche in this sense to be attested.) e.g. Thucydides 2.68.5 ἡλληνίσθησαν, Josephus AG 1.129 ἡλλήνισται.

  • JOSE says:

    Excuse me, ΗΛΛΗΝ-… or ΕΛΛΗΝ-?

  • opoudjis says:

    To explain maldotco: in Lojban, malglico, "damned English", was the phrase used to disparage literally translating English into Lojban, instead of grokking the structure of Lojban. maldotco is "damned German" in Lojban, same idea.

    I don't know how fair my pronouncement is; Zamenhof's style is oracular and emotive in Modern Esperanto ears, but that's because Modern Esperanto style was pretty much established by Kabe (Kazimierz Bein) instead, before he left the movement (and became a verb, kabei, "to ditch Esperanto"). As a result, Kabe to Modern Esperanto ears sounds… boring. The penalty of success.

    Mi notu ke ĉi tiu afiŝaĵo altiris laŭ la vizitindikilo multe pli da atento ol kutime ĉi tie. Tamen ĉi tiu blogo restos plejparte prigreka. Kaj salutojn al la novaj legantoj—eĉ se ili eble ne daŭre trovos anekdotojn pri la greka lingvo same interesaj…

  • John Cowan says:

    I remember someone posting to the Conlang list a detailed comparison showing that modern Mandarin has more inflectional affixes than modern English — in both cases, of course, the number barely breaks out of single digits. So perhaps it is time to break the ancyent traditions and boldly say "isolating languages like English and Chinese".

    I also well remember you characterizing Zamenhof's Esperanto style as "charmingly maldotco".

