Subscribe to Blog via Email
January 2025 M T W T F S S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Your Fractal Analysis of Esperanto does not add up
This is a blog on the Greek language. That is why it is called Hēllēnisteúkontos, “From the guy who has been a scholar of Greek”. But I arrogate the right to post here about other linguistics stuff that I find of interest. I have a below-the-fold arrangement, so you can bypass it easily.
This post is on misuses of numerical methods in linguistics, as applied to Esperanto.
I am no longer in any honest sense an Esperantist, or a Lojbanist, or a Klingonist. (Or, more self-consciously, mi ne plu estas esperantisto, .i mi ca ba’o lobypli, ‘ej tlhIngan Hol vIlo’ ‘e’ vimevpu’.) Not out of malice, indeed with a good deal of regret, but that’s where life has taken me.
But I mention my learnings from those languages on occasion, and in The Other Place, I just drew an analogy between the language politics of Esperanto cultural functions and Acadian cultural functions. Someone found the posting by googling “Esperanto”, and that made be follow some links, that led to some links…
… that led to mention of a recent couple of articles using computational methods to compare the linguistic profiles an English and an Esperanto text, and come up with the conclusion that English and Esperanto were different. And then to make the extra conclusion that natural and artificial languages are different. Here’s the articles: #1, #2.
I am grateful that the slices of the Esperanto blogosphere I sighted mocked this study: sample 1, sample 2. And I’m going to go to town on this here, because it deserves mockery.
Gillet and Ausloos, you are idiots. Maybe not in Computer Science, but on my turf, you have committed grand folly. You have taken two data points, English and Esperanto; you have compared the profile of their word lengths and word frequencies, and have decreed they’re different. Fine, they’re different. That says less than nothing about a comparison between artificial and natural languages! In God’s name, put up a study with Inuit, Turkish, and Chinese on the one hand, and Esperanto, Klingon, and Lojban on the other, and *then* you might have something relevant to say.
English and Esperanto word lengths and word frequencies are different. Oh come on.
See, this is the problem with computer scientists doing linguistics as if linguistics never existed. Just load some texts into a Multifractal Analysatron 2000, churn some gears, and that will tells us something interesting about language. Well no it won’t, not if you’re asking the wrong question, and have no framework to make sense of the answer. It’s not that we can’t learn anything new from the Multifractal Analysatron; but without building on what we already know, you’re guaranteeing that what you do build will fall over. It was computer science people that came up with “Garbage In Garbage Out” after all.
I was in the library yesterday, for the sake of melancholy nostalgia, and to see what I could get on French-Canadian linguistics. I walked by Diachronica, and leafed through it to see what was new in historical linguistics. April McMahon, who wrote a wonderful textbook on language change 15 years ago, has just co-authored a new book on… numerical methods in historical linguistics. My heart sunk. It shouldn’t have, because April McMahon has earned my trust.
As the review said, one of the things McMahon points out in the book is, there is a regrettable tendency in numerical approaches to linguistics to just put the raw data into the Analysatrons, and see what happens. And she said, in a more measured and thoughtful way than I just did, that this is nonsense: a linguist still needs to make sense of the input, identify what correlations are worth pursuing, and filter out what methodologically needs filtering out.
I mean, word lengths and word frequencies? Even Plato had a more sophisticated understanding of language structure than that; and that’s not saying much.
There are some more details I’ll rattle off, with regard to word length in particular. Triggered by the fact that in their preliminary studies, the authors were surprised to find more similarity with German and Spanish, and least similarity with French and English.
If you’re surprised to find affinities between German and Esperanto, you know nothing of the history of Esperanto. And with just word length as your tool, and a comparable amount of inflectional morphology, I don’t know how meaningful the affinity they discovered is anyway.
But in particular, Esperanto is agglutinating, so it likes its words longer than an isolating language like Chinese or English (I think it’s fair by now to call English isolating). And Esperanto as a literary language was substantially influenced by German, because its most influential authors worked in the shadow of Prussia and the Austro-Hungarian empire, and German was a default model to them. (I’m thinking Ludovik Zamenhof and Kazimierz Bein in the first generation—Litvak Jew so culturally Russian, but with access to German; and Polish, respectively; and Julio Baghy and Kalman Kalocsay in the second—both Hungarians.)
The love of compounding is a way of dealing with the requirement to keep vocabulary minimal in an artificial language; but the choice of compounding rather than more analytical expressions is informed by German, not by interlinguistics. Not to mention the suite of compounds overtly calqued from German (verŝajna for wahrscheinlich “apparent”, for instance).
The second paper made the mistake of profiling sentence length, and that was even more boneheaded. Sentence structure in a literate language is decidedly influenced by cultural contact: all of Europe has the mark of Latin subordination on it. And again, Esperanto sentence structure did not happen in a vacuum: Esperantists emulated the examples of their teachers and writers, and the teachers and writers patterned after natural language models. Which again were substantially German.
When we talk about the “spirit” of a language, we’re normally not primarily talking about morphology and syntax. We’re talking about semantic maps, and discourse structures, and idioms. It’s not that intangible, it’s just somewhat harder to formalise than morphology and syntax. Inasmuch as the spirit of Esperanto has kindred out there, however tenuous, that kindred is German. But profiling word lengths and word frequencies is not going to tell you much about morphology and syntax. And it will tell you little more about discourse structures.
At any rate, why *would* Esperanto be so different to natural languages? Some regularisation in its inflectional morphology, sure; but isolating languages are even more regular, by not having any inflectional morphology at all. Agglutinativity, sure; but Turkish and Lakhota were agglutinative before Esperanto was. Ludovik Zamenhof was not Mark Okrand, easter-egging his language with violations of linguistic unievrsals.
The only quirk I can think of worth noting is Esperantists turning affixes into independent words. That quirk is artificial in origin: Zamenhof was supposed to say, in modern terms, that all morphemes of Esperanto are meaningful, and ended up saying that all morphemes of Esperanto are independent words. This has stuck: the diminutive suffix -et- is also the word for “tiny”, the object nominaliser -aĵ- is also the word for “thing”, the collective suffix -ar- is also the word for “grouping”. The trend has been taken far with successive generations of Esperantists, but was started by Zamenhof himself.
Yet even this is not alien to natural language. In fact, in its guise as degrammaticalisation, it was a favourite bone of contention between Lyle Campbell and Elizabeth Closs Traugott in the ’90s.
(Grammaticalisation theory claims that grammatical affixes come from particles and particles come from full words. So the suffix -like used to be the noun lich “body”. Degrammaticalisation is when the reverse happens; the canonical examples are from Estonian, but it also happens in English with up the ante: a particle—a preposition—turning into a verb. Is it an occasional exception under special circumstances? Or is it frequent enough to undermine the core premiss of grammaticalisation? Actually, that’s an ideological question, and it’s hard to resolve it one way or the other. Don’t know if anyone’s claimed victory.)
At any rate. Garbage On Garbage Out. Let that too be a lesson to… well, somebody.
Old Man Hare: Etymology
I didn’t get to hit the books on Old Man Hare, but I’ve had enough feedback from readers and blegs that I can tell somewhat more of a story than last time. Let’s start with what we know.
Byzantine:
- We know of four mediaeval instances of the word.
- In Suda, 10th century, λαγώγηρως is used to gloss μύξος. I glibly said “all we know about μύξος is that it’s a λαγώγηρως”, which is good wisecracking, and poor insight. As LSJ has pointed out (h/t Nikos Sarantakos), μύξος is just a mangling of μυωξός “dormouse”.
- There are two instances in a collection of manuscript miscellanea, published by Delatte ( Anecdota Atheniensia et al. ) in 1927. I haven’t sighted the volume; one instance is λεβηρίς λαγόγερω “Old Man Hare pelt”, and one is λαγόγερος.
- There is one instance in a scholion on Lucian, used to gloss μυγαλῆ “field-mouse”. This scholion was cited in Stephanus’ 16th century dictionary , and in Bast’s 1805 Critical Letter, but is not included in the standard edition of Lucian scholia.
- I don’t know when the texts are dated from, but most scholia come from between 1000 and 1500. The text of the Suda could have been tampered with and embellished by scribes at any time up to the copy we have; but my default assumption is, this word was included as a gloss in the tenth century.
- The mediaeval instances all correspond, in orthography and spelling, to “hare” + “old-man”. The γήρως [ɣiros] spelling of “old man” is archaic; the modern form is γέρος [ɣeros]. -ως is an archaic second declension, which is likely scribal rather than vernacular.
Modern Greek
- There is a modern word, which I’ve seen as λαγόγερος, λαγόγυρος, and λαγογύρι, which denote the European ground squirrel aka European souslik.
- The minority form λαγόγερος also means “hare” + “old-man”. The majority form means “hare” + “round”. λαγογύρι is just a neuter variant of λαγόγυρος.
- λαγόγυρος and λαγώγηρως are both pronounced identically, [laɣoˈɣiros].
- The word is certainly attested, according to blog sightings, in Edessa (λαγόγερος) and near Corinth (λαγογύρι).
- γέρος in Modern Greek is both the adjective “old” and the noun “old man”.
- Greek does not allow noun+adjective compounds: λαγόγερος does not make any sense as “an old hare”. The only way it makes sense is as a noun-noun compound, “hare” + “old man”.
To interpret these:
- There is certainly no guarantee that λαγώγηρως meant the selfsame animal that λαγόγυρος does now (h/t Ηλεφούφουτος). It is certainly possible that the word meant a dormouse/fieldmouse back a millenium ago, and a squirrel now.
- OTOH, the scholiasts did not necessarily have a clear idea of what the ancient μυωξός and μυγαλῆ were. So they could have meant the same animal.
- λαγόγυρος “hare roundabout” is etymologically opaque.
- λαγώγηρως could well have been folk-etymologised as λαγόγυρος in modern times, given that Modern Greek no longer has [ɣiros] for “old man”. In fact since the two are pronounced identically, the spelling could be the fault of a modern scholar, who did not know about the mediaeval antecedents of the word (or the variant λαγόγερος).
- λαγόγερος is problematic as a compound: it cannot mean “an old hare”, the only way it can make sense is as an anthropomorphism, “a hare-like old man”.
- To this urbanite, sousliks look anthropomorphic. They stand on their hind legs:
That’s weak evidence for λαγώγηρως meaning “squirrel” from the start.
Bulgarian:
- There are various Slavic names for the European ground squirrel. Northern Slavic has variants of */sus/ (h/t Epea Pteroenta).
- Slovenian and Serbian have /tekunitsa/. The Bulgarian reflex of the */sus/ root meant “rat”, which suggests it used to mean “squirrel” and was displaced. But it’s just as possible that it changed meaning to “rat” without external prompting.
- The standard Bulgarian word for the European ground squirrel is /laluɡer/.
- The dialectal variants for the European ground squirrel (h/t Julia Krivorucko) are /laɡuder/ in Southern and Eastern Bulgaria; /laɡuntʃi/, /ləɡuntʃi/, /ladʒunjak/, /lədʒunjak/.
- These point to an original /laɡjuT/ or /laɡuT/, with T any coronal: /l, n, d/.
- The standard /laluɡer/ seems to be derived from */laɡuler/, and ηλε-Φούφουτος thinks its an assimilating metathesis, with /laluɡer/ easier to pronounce; the word for “chatterbox” is /laladʒija/
- The Bulgarian etymological dictionary does not speculate on the etymology of /laluɡer/, which suggests it is etymologically opaque in Bulgarian.
- The Greek Balkanist Christos Tzitzilis (h/t ηλε-Φούφουτος) has suggested /laluɡer/ < λαγώγηρος without further discussion, as an illustration of Greek /o/ > Bulgarian /u/; the other example was /protuspor/ < /protospori/ “first seed”. Northern Greek already has unstressed /o/ > /u/, but this seems to be a general process in Bulgarian loans.
- The Bulgarian linguist Slavova has recently cited Tzitzilis’ derivation in passing, without disputing it.
Back to Greek:
- Modern λαγός “hare” has an allomorph λαγουδ- /laɣuð/ (from the Mediaeval diminutive λαγῴδιον), used in compounding: Modern diminutive λαγουδάκι, surname Λαγουδάκης, λαγουδοφωλιά “hare warren”. That would explain Bulgarian /laɡud/
- The accepted etymology of λαγωνικό “hunting dog” is from λακωνικό “Laconian dog”, with contamination from λαγός—so “hare dog”. That would explain Bulgarian /laɡun/.
- Modern Greek has the word λαγουδέρα /laɣuðera/, “rudder”; its etymology is unknown, but it looks like “hare”.
What does this tell us?
- The sideways step of */sus/ in Bulgarian from squirrel to rat confirms the flexibility of animal names across time, and that a λαγώγηρως was not necessarily a squirrel. Still…
- The Greek form appears three or four centuries after the arrival of the Slavs in the Balkans. So it could be originally Greek or originally Slavic.
- The form is attested in Greek Macedonia (which makes a Slavic loan quite possible) and the Peloponnese (which makes it not impossible). If I had attestation from the islands, a Slavic loan would look more doubtful.
- If the form is indeed opaque in Bulgarian, and especially if the form is absent in Proto-Slavic, then a Greek origin is likelier.
- The form attested in the Middle Ages in Greek (and reasonably early at that) is only /laɣoɣVr/, not a form based on /laɣuð/ or /laɣun/. If the form was derived as some sort of Greek adjective from “hare”, it has only left traces in Bulgarian: there are no traces of a λαγουδ- based form in Byzantine or Modern Greek. That’s not impossible, but it’s not my default assumption.
- George Baloglou wondered whether there might have been calquing of a Slavonic form into a Greek “Old Man Hare” within Bulgarian, before the form moved south, which would explain the grammatical awkwardness. If we accept the anthropomorphic “a hare-like old man”, the formation is slightly less awkward; and while such cross-linguistic calques do happen (e.g. German Handy for “mobile phone”), they are rare and learnèd. Again, not my default assumption.
- The palatalised variants of /laɡuntʃi/ and /ladʒunjak/ in Bulgarian are not obviously motivated within Greek; I don’t see a clear reason why Greek would offer *λαγιουδάκι or *λαγωινικός as models. For now, I’m happy to make that Bulgarian’s problem.
- The conservative take is to assume the Greek word was always λαγώγηρως; that it was borrowed into Bulgarian as */laɡuɡer/; and that the awkward l-g-g was dealt with in dialect by assimilation to l-l-g (/laluɡer/), as ηλε-Φούφουτος suggested, or by dissimilation to l-g-d (/laɡuder/). The latter dissimilation may have been modelled after the Greek diminutive λαγούδιν, without that implying that the beastie was ever called a “little hare” or variant thereof.
- I don’t know what’s happened with /laɡuntʃi/ and /ladʒunjak/, and it’s a bit much to go straight from /laɡuɡer/ to /ladʒunjak/. But unlike /laɡuder/, I don’t see a straightforward way for Greek to explain that level of variation.
More tentative than I’d like, but it looks originally Greek, and there’s circumstantial evidence (the standing on two legs) to suggest it was a squirrel back then too. That’s my take at any rate.
War of Troy
For far, far too long, scholars have treated Early Modern Greek literature as linguistic quarry, and have neglected these texts as literature in their own terms. Over the past couple of decade, this injustice has finally started to be redressed, as the Romances in particular have gained much deserved attention.
This post, on the other hand, continues in the bad old tradition.
As the TLG expands its coverage into Early Modern Greek, I have gone through the word recognition of the War of Troy. The War of Troy is an interesting text, and I am going to say some very superficial things about it.
The War is a retelling of the events of the Iliad. It’s not a first hand retelling. Around the end of the Roman empire, two people calling themselves Dictys the Cretan and Dares the Phrygian—and eyewitnesses to what went down in Ilium—wrote popular Latin retellings of the story. The Middle Ages being a gullible time, Dictys’ and Dares’ narrative ended up taken more seriously than Homer’s. After all, Dictys was there!
A few centuries later, Benoît de Sainte-Maure based his Roman de Troie on Dictys’ and Dares’ Latin. Some time after that, an anonymous Greek produced a Greek translation. (Outside Crete, anyone writing in the vernacular was anonymous: it wasn’t the kind of writing you took credit for.)
So we have a three generations down retelling of the Homeric original. Greeked by a writer who knows the names of Achilles and Helen in Greek, but not much else: there’s no evidence the translator knew any Homer. (And why would he need to? After all, Dictys was there!)
So you can imagine what’s happened to the names in this. That was a fun couple of nights to work through. The translator would take a look at the Old French, breath in, and guess. From what I can see, Benoît had done the same. Telemachus (Τηλέμαχος “Fights-Far-Away”) becomes Θελέμαχος “Wanna-Fight”. Assyria becomes Ζύρη. Boeotia Βοιωτία becomes Βοέκη (via Boëce). The King of the Scythians Rex Scytharum became Citare and thence Κιτάριος. A king from Syme, ex Syme, became Essimieïs and thence king of Ἐξιμιόνη. Somehow, Zeleia Ζέλεια ended up as Σιτζήλια, and Sicily Σικελία as Ζήλικος. We even have a Ἱουπιτής and a Νέπτιπος in the cast. That’s Jupiter and Neptune to you. And the only way I can explain Iphinoös Ἰφίνοος becoming Ἰσίδιος is via f looking like ſ (long s). (Did they have long s in 1400?)
All very fish in a barrel, that, so we’ll move on. The War of Troy was first published very very late. In fact, 1996. Each romance has its own vocabulary: the romances liked coining compounds, the War in particular has a lot of partly digested Old French, and there are some words that look to be one-offs anyway. But there are three reasons why we’re going to have serious gaps in documentation of the War’s vocabulary for a long time.
- Most Early Modern texts had some sort of edition, however crappy, available by the time Kriaras started writing his dictionary. The War didn’t. So the volumes of Kriaras written before 1996 won’t know anything about its vocabulary.
- The volumes of Kriaras up to 1997 each had addenda about new words that had turned up since the last volume. A horrid chore if you’re looking up a word, but editing Early Modern Greek is still a boom industry. (Or it was 10 years ago, when I was able to follow it.) So that kind of update does need to happen. After its decade hiatus, the new volumes don’t have addenda. So unless there’s a change of policy (or they become a real electronic dictionary, with dynamic update), Kriaras is not going to cover the War at all.
- That’s *real* electronic dictionary. The online abridgement was supposed to keep updating with each volume of the post-hiatus full dictionary. Nothing’s happened in the past three years.
- The editors of the War said they would publish a Volume II with a glossary. (Mercifully, they did put in a few pages of the more common undocumented words in Vol. I.) It’s been 13 years, and Google tells me naught; I have no particular reason to hold my breath that I will see a Vol. II.
It’s a shame, because the glossary they do include has some words that I’m scratching my head about.
- I’m convinced I’ve seen καιρογεύω “to hire” somewhere before, but I have no idea where. Neither does Kriaras or Trapp.
- A γορζέρα is a visor on a helmet, translating French ventaille. (Yes, of course the War of Troy has visors and jousting and mediaeval stuff.) But ventaille is not [ɡorzera], and there’s no way that’s a Greek word. Where did it come from?
- καλανίζω means “to spatter”. Where did *that* come from?
- Ditto τόρτσα [tortsa] for chandelier, and χωρίγιν [xoriɣin] for “cement”.
I suspect some of this is Italian or Venetian, but some of it clearly isn’t.
There’s a reason the War took so long to publish. It’s huge by the standards of the time: 14000 verses. It was quite popular (six surviving manuscripts), so there’s a lot of manuscript collation to do. We have the French original, which you also have to take into account when doing the collation. What that leaves you is a lot more donkey work than usual for editing a text: normally you’d be lucky to have a couple of manuscript witnesses.
It also means the temptation there is emend the text, to match the French original more closely. The degree of emendation in the War is more than most scholars these days are comfortable with; and because the emendation did not prioritise linguistic plausibility, you have to look in the margin (the source readings) if you’re going to do any linguistics with the text. There are anachronisms in there.
Emendation in mediaeval texts is inevitable, just as it is inevitable in Classical texts: texts got miscopied, mistransmitted, misconstrued. But vernacular texts don’t work like Classical texts in transmission: the scribes feel a lot freer to tinker with the text, because ten centuries of Ancient Authority aren’t going to gainsay them. And the results of a scribe tinkering with a vernacular text are not as noticeable as with a Classical text—so you have less of a gut instinct to go with, for which of two variants is the original.
Gut instinct is a risky thing to rely on anyway, and Early Modern Greek texts have suffered a lot from Modern Greek scholars coming along, and assuming they know the language and metre and poetics of the texts better than the scribes did. The scribes were often enough blockheads, that’s true. That doesn’t mean modern editors aren’t fallible.
The world owes Stylianos Alexiou, for instance, gratitude for making the Escorial Digenes legible, and reviving Cretan Renaissance theatre. The world does not owe Stylianos Alexiou gratitude for:
- Assuming 14th century iambic heptameter was the same as 20th century iambic heptameter, and emending it when it wasn’t
- Assuming the 16th century dialect of Rethymnon was the same as the 20th century dialect of Rethymnon, and emending it when it wasn’t
It’s not that they are not usually the same; but they are decidedly not always the same, and you’d better have an explicit argument for when you do tamper with the text.
And I’m sorry, but in the same vein, if you’re editing a 14th century text like the War of Troy, and you’re looking to emend a verse with a single-syllable future particle, because the French original is in the future tense—you do NOT use a particle that first appeared in Greek in 1700! What on earth use is that? Would you put “gonna” into your Chaucer? “Kthxbye” into your Dickens?
“tl;dr”, that you can put into Dickens…
The particle in question is θα, which in 1400 was only starting to be emerge as the uncontracted θέλω να. The monosyllabic particle they were actually looking for was να, the subjunctive marker: the future was “I may go” in 1200 before it started becoming “I will go” in 1400, and by 1700 “I’ll go”—if I can use English analogy here.
“θα exists in this text only by emendation (1816, 8796, 8972)” p. lxxvii—it shouldn’t be there at all. Verse 1816:
Ἐτοῦτοι ὁποῦ εἰς τὰ κάτεργα θὰ εῖναι διωρθωμένοι
“Those who in the galleys shall be set right”
1816 θὰ Pap[athompoulos], cf. 4669 E cil qui remandront as nes : ἂς ABVX.
The ἂς reading is grammatically awkward, and somewhat odd (“those who in the galleys may they be set right”); but it’s in all the manuscripts, and is not outright wrong. If you want to claim that the translator must have rendered the future remandront with a future, then for pity’s sake use the historically plausible future: Ἐτοῦτοι ὁποῦ εἰς τὰ κάτεργα νὰ εῖναι διωρθωμένοι
Let that be a lesson to… well, to someone.
On nominalisations ending in -εία
A post on Greek spelling. You’ve been warned.
The spelling of the noun ending -εία vs. -ία had come up a few months ago on the Magnificent Nikos Sarantakos’ blog, as an orthographic bedevilment. Modern Greek writers feel ἀμηχανία (awkwardness) about how to spell the ending, and they’ll be reassured to know the Byzantines felt the same ἀμηχανεία.
The story goes like this:
- Ancient Greek has an ending -ία, used to form abstract nouns from verbs and adjectives. It corresponds to -ness, and when it is borrowed into English, it shows up as -y. So:
ἄμνηστος /ámnɛːst-os/ “unremembering, forgetful”, ἀμνηστία /amnɛːst-ía/ “forgetfulness, amnesty” (because I’m forgetting your crime).
ἁρμόζω /harmó-zdɔː/ “I fit”, ἁρμονία /harmon-ía/ “joint, suture, harmony” (because the parts fit together). - When -ία is attached to a verb ending in -ευ- /ew/ in Proto-Greek, the result is spelled -ε-ία /ewía/ > /e.ía/ > /éːa/. That is to say, it’s the same /-ía/ suffix, attached to /ew/. But because /w/ did not stick around in Greek, /ewia/ ended up pronounced /éːa/, and was distinct from the normal /ía/ ending.
- So: δυνάστης /dynást-ɛːs/ “master”, δυναστεύω /dynasté-wɔː/ “be lord over”, δυναστεία */dynast-ewía/ > /dynást-éːa/ “lordship, dynasty”.
εἴρων /éːrɔːn/ “dissembler”, εἰρωνεύομαι /eːrɔːné-womai/ “play dumb, use understatement, make fun of”, εἰρωνεία */eːrɔːn-ewía/ > /eːrɔːn-éːa/ “dissembling, understatement, mockery, irony”. - So there was a rule on how to form these nouns. If a verb ending in -ευ- was involved, it ended in -εία. Otherwise, it ended in -ία. So ὀνοματοποιέω /onomatopoi-éɔː/ “make up a name”, ὀνοματοποιία /onomatopoi-ía/ “onomatopoeia”. It isn’t ὀνοματοποιεύω, so it’s not ὀνοματοποιεία. Which is just as well: onomatopoeia has enough vowels in it already.
- Problem number #1 with spelling these nouns was, by the time of Christ, ει and ι were pronounced identically. So the subtle etymological differentiation was begging to be undone.
- Problem number #2 was, -ία was not only applied to verbs and adjectives. It could also be applied by analogy from other abstract nouns, even though there was no underlying verb to derive it from. So μαντεύω “to tell the future” gives μαντεία “telling the future”. People starting making up nouns of different ways of telling the future: ἡλιομαντia “by the sun”, ἡμερομαντia “by the date”, κυνομαντia “by dogs”, λυχνομαντia “by lamps”.
- How do you spell these? Do you spell them like the simple noun μαντεία? Or do you say that there is no such verb as ἡλιομαντεύω or λυχνομαντεύω, so you should use the simple -ία spelling? LSJ chooses the analogy: -μαντεία.
- Similarly, λάτρις “hired servant” > λατρεύω “adore” > λατρεία “adoration”. When people made up a word for “adoration of idols, idolatry”, /eːdɔːlolatría/, they did not go through a verb εἰδωλολατρεύω. So how were they supposed to spell it? Like λατρεία “adoration”? Or should they instead derive it from εἰδωλολάτρης “adorer of idols”, which would make it -λατρία? This time, conventional spelling did not go with the analogy, and decided to spell it εἰδωλολατρία, deriving it straight from εἰδωλολάτρης. Which is more plausible etymologically.
- But that brings you to the unfortunate situation that, whenever you spell a word in -ía, you need to know the derivational history of Greek—whether a corresponding verb in -ευ- has ever turned up or not. This is of course ludicrous, and people were thoroughly confused. LSJ chose analogy for λυχνομαντεία, but notes that the papyrus the word turns up in spells it λυχνομαντία. LSJ has εἰδωλολατρία, but a verb εἰδωλολατρεύω does show up, in Eusebius, Athanasius, and John Chrysostom; and the TLG has 342 instances of the εἰδωλολατρεία spelling versus 1038 of the εἰδωλολατρία spelling.
- To complete the confusion, Problem #3. Byzantines couldn’t see why verbs ending in -εύω produced nouns ending in -εία, but verbs ending in -έω didn’t also keep the epsilon. I mean, they both had epsilons in them; and Byzantines didn’t know or care about the prehistory of */w/ in Greek. So they started spelling with -εία words the ancients had never spelled with -εία. Remember “amnesty”? There isn’t just an adjective ἄμνηστος “forgetful”; there’s also a derived verb ἀμνηστέω “to be forgetful”. And if ἀμνηστέω exists, that’s reason enough to start spelling “amnesty” with an epsilon, as ἀμνηστεία.
The thread over at Sarantakos’ included the host’s melancholy observation that this conundrum itself was reason enough to long for a phonetic spelling reform. It probably won’t come to that, but it *is* enough for Greeks to turn to their linguists, and plead with them “give us a rule we can follow!”
(The thread also triggered Sapere Aude’s immortal snark “άβυσσος το spelling αυτού του weird lingo”.)
The guesswork of “whether a corresponding verb in -ευ- has ever turned up or not” is not such a rule. (Has there ever been a verb ἀγγελολατρεύω “to venerate angels”? No peeking!) Universally spelling such nominalisations with just -ία is also a non-starter: λατρία does look like wholesale spelling reform. The sensible compromise is, the simpler alternative when in doubt. (Which applies to a lot of contemporary Greek spelling.) That means compound nouns like εἰδωλολατρία always get spelled with an iota, no matter what verbs Eusebius came up with.
Not that I’m going to bother initiating correspondence with whoever’s running Greek spelling these days. (It sure ain’t the Academy of Athens. Education Ministry, I guess.)
Old Man Hare
[EDIT: followup post]
As I already mentioned in the past, the occasional Early Modern Greek word ends up in LSJ, because it has been used in a scholion to explain an Ancient word, and LSJ figured they’ll take all the help they can get.
Such a word is λαγόγηρως. Literally, it’s “Old Man Hare”. Actually, literally, it’s “Hare Old Man”, but that just wouldn’t work in English. As recorded in LSJ, it’s used in the scholia to Lucian Dream 24 to gloss μυγαλῆ “field-mouse”. You won’t find it in the 1906 Rabe edition of the Scholia to Lucian, which the TLG has: it’s “ap Bast. Ep. Crit. p. 169”. This is an instance of Classicists’ infuriating habit of using abbreviations without explaining them anywhere (and I checked). After some googling, I worked out it means that the gloss is mentioned in Friedrich Bast’s 1805 Lettre critique de F.-J. Bast à M. J.-F. Boissonade, sur Antoninus Liberalis, Parthenius et Aristénète. (So Rabe’s edition does not contain every single piece of Byzantine commentary ever authored on Lucian? Grrr.)
The word λαγόγηρως is also used in Suda, the mixmatched 10th century encyclopaedia, to gloss μύξος—although that doesn’t help us much, because the only thing we know about a μύξος is that Suda says it’s a λαγόγηρως.
So it’s likely a Modern Greek word, and given the scholion to Lucian, it’s likely a field-mouse, or some other rodent of that ilk. Old Man Hare shows up in other dictionaries too, but it does not show up in the big contemporary dictionaries of Greek; so other lexicographers are on their own. Trapp shrugs and says it’s just “an animal”. (He does at least record the more modern-looking variant λαγόγερος.) Kriaras can afford to go further, particularly given where it has been compiled (I’ll explain in a sec): it reports that a λαγόγερος is “a kind of rat, a μυγαλῆ”, and the passage it cites in response is an Early Modern falconry manual, in which the Old Man Hares are seized by birds of prey.
So those aren’t Human Old Men being seized, and while they could be hares, there’s no reason to think the Lucian scholion is wrong: it’s some sort of rodent.
It’s also a μύξος, whatever on earth that is, and as I was perusing recent additions to Suda On Line, I noted the newly translated entry on μύξος, expressing some puzzlement about glossing an unknown word with another unknown word.
At this stage, I had not checked LSJ, and I had not checked Trapp (which would have told me nothing anyway), and I certainly had not checked Kriaras. Instead I noticed that the two unknown words were not the same flavour of unknown. Suda didn’t just say a μύξος is an Old Man Hare: it said a μύξος is an Old Man Hare παρ’ ἡμῖν. That παρ’ ἡμῖν means “with us”; and in Suda’s way of structuring definitions, it means “in our language”. As in, our vernacular, not Ancient Greek.
So without looking at Kriaras, I realised this word was at least Early Modern Greek, and quite likely Modern Modern Greek. I popped across to the online Triantafyllidis dictionary, and didn’t find Old Man Hare there: so it’s not a word that’s made it to the Contemporary Standard. But figuring that there are always surprises to be had on the Interwebs, I googled λαγόγερος just in case.
I found Old Man Hare in a Greek digital photography forum. Like me, the photographer was an urbanite who wouldn’t know a field-mouse from a dormouse (which was the Suda translator’s first surmise). I mean… I don’t know: *are* they the same thing? But the chap took the pic, recorded the place where and when the photo was taken—midday, near Edessa, in Greek Macedonia; and added what the locals call the beastie. Ladies and Gentlemen, courtesy of poster “Junior”, meet Old Man Hare:
He exists, and he certainly looks like what Lucian’s scholiast had in mind—and what the falconry writer established was an appropriate afternoon snack for an eagle.
And of course, some Greek dialectologist somewhere has recorded the fact that in Edessa (and probably elsewhere) this beastie is called an Old Man Hare. Because the Modern Greek dialect dictionary is still stuck at delta, it’s not straightforward to find out who—although at least a draft of the remaining letters is now prepared. But Kriaras’ dictionary staff have a fair collection of dialect glossaries on site, so they would have had the wherewithal to figure it out. And even if they didn’t, Edessa is just a 94 km drive away from downtown Salonica. You’ll probably run into Old Man Hare before you get to the waterfalls. (That’s why the Other Language’s name for Edessa is Vodena, “waters”.)
Three other hits of note for Old Man Hare online. One was the Suda On Line entry, cached when it wasn’t yet translated. One was from another Greek forum, this time ecological, recording Old Man Hare as one of the animals you might be surprised to find in the vicinity of Thessalonica. The Byzantine pronunciation /laɣoɣiros/ seems to survive for our rodent friend, because one posting later, courtesy of poster Kostas Karpadakis, Old Man Hare shows up again, this time spelled as λαγόγυρος, “Hare Roundabout”:
I still can’t tell you whether it’s dormouse or field-mouse. Or hamsteroid. I showed the three links to Nikos Sarantakos (he whose Magnificent Blog I keep extolling), and he said that even before he got to the forum mention, he’d worked out this must have been a χαμστεροειδές. That kind of nonce macaronic coinage—American stem, archaic suffix—is pretty damned funny if you’re steeped in the angst of Greek language history.
*I* can’t tell you, but Karpadakis did not take the photo himself: he linked to zoology site in Novi Sad Uni, and the photo is labelled s-citellus02.jpg. A citellus is none of the above: as the site itself says, this is a Spermophilus citellus, which in English is the European ground squirrel, aka European Souslik.
And when I google his “historically wrong” spelling λαγόγυρος, I get not three or four hits, but 321. Confirming it as the Spermophilus citellus or Citellus citellus. Some more links: citellus #1, citellus #2, citellus #3. And a table of beastie names at Nature Names for Tourists: Local names for distinctive European mountain wildlife:
English | Polish | Slovak | Slovene | Romanian | Bulgarian | Greek | scientific |
---|---|---|---|---|---|---|---|
European ground squirrel or souslik | suseł moręgowany | syseľ pasienkový | tekunica | popândăul | лалугер | λαγόγυρος; σπερμόφιλος | Spermophilus citellus; also Citellus citellus |
So the Bulgarians got their word for the beastie from the Greeks.
In fact, that /laluɡer/ in the Bulgarian may suggest the modern spelling λαγόγυρος, which in Byzantine Greek would have been /laˈɣoɣyros/, is more accurate, and the Old Man bit was a written correction… Nah, the /u/ is in the wrong spot. Some Bulgarian phonological process, I guess: the Greek Macedonian pronunciation would be /laˈɣoɣirus/, which doesn’t explain лалугер. So I’ll stick with the assumption for now that this was originally Old Man Hare reanalysed as Hare Roundabout, until I hear something to the contrary. And that the accepted modern spelling is λαγόγυρος, although that form isn’t in Triantafyllidis’ dictionary either.
The third link for λαγόγερος, Sarantakos dismissed as “an incredible concoction”, and I’m not disagreeing. The article was by an Italian classicist, and it was dealing with the Suda entry on μύξος as a whole, which goes into bizarre beliefs about donkey urine. The identity of Old Man Hare comes up just before the conclusion.
I’ve never studied Italian, but what with Esperanto, Latin, French, and a fair exposure to Classical Music in my youth, I can sort of read it. It’s helped working in a French & Italian department, and not being as embarrassed about speaking in Super Mario Bros. Italian as I am about speaking in Pepe Le Pew French. I could even make the minimal effort of dealing with the mojibake of the page, such as by, I dunno, switching my browser encoding to Latin-1.
But once I’d worked out the article’s claim, I had no motivation to proceed further. A dictionary I had not checked was Sophocles’, and the author accepted and elaborated on Sophocles’ surmise that an Old Man Hare was a kind of fish. Because λαγώς was also a word for sea-hare.
Wee tim’rous beastie, you’re not a sea-slug of the Aplysiomorpha clade perchance, are you?
No, I didn’t think so.
That is ungracious and horrid of me. I had the benefit of late Aughties Interwebs, this guy… well, this guy was writing in 2006, so he did as well. Hm. I had access to Modern Greek (even if it was in digital photography forums, detouring via Novi Sad Uni’s Zoology department), this guy likely didn’t. Still, the guy was not writing after the compilation of LSJ, and even if Suda is obscure about Old Man Hare, the scholiast to Lucian is not. Sophocles at least offered a definition for λαγόγηρως, but Sophocles *is* dated, and you can do better than that in general. The falconry manual may well have been inaccessible; but Suda has a whopping big “in our vernacular” in there, he could at least have asked some Greek contacts.
The blunder here is violating the Common Sense precept of etymology: If you’re looking for Latin etymologies, you start your search on the Tiber. I tracked down the origins of that saying in my personal blog just before—taking a break from obsessing about constructions of Franco-Canadian identity; and it sounds a hell of a lot better in the original German. I’m always a little miffed when Classicists get as far as Byzantine Greek, and don’t notice the live language spoken on the other side. (Witness LSJ’s definition of στοίχημα: it’s “wager”, not “deposit”, and they got “deposit” by a casual reading of Eustathius’ scholion.)
But now at least, we have photographic refutation. In Edessa, you search for etymologies on the river Voda.
A sentence I smirk at, I confess, given where you search for the etymology of the river Voda. Now unsurprisingly renamed Edesseos.
Thanasis Costakis RIP
Thanasis Costakis, doyen of Tsakonian Linguistics, has died, and the next Tsakonian Studies conference at Lenidi will be held in his memory.
I have not heard of his passing anywhere else, and cannot find an obituary online, so I assume it has been this past year.
I was lucky enough to talk to him in Athens in ’95. A kindly old man, and resigned to the death of his language.
Θανάσης Κωστάκης, 1906-2009. Αιωνία του η μνήμη.
Lerna VIIc: Variants
The various counts of lemmata that I’ve been putting out for the last while have made little mention of the difficulty in deciding whether two forms belong to variants of the same lemma, or distinct lemmata. The judgement call is difficult enough within a homogeneous language, with slight variations in derivational morphology. It’s even worse with a large linguistic span like we’ve been dealing with, with lots of dialectal variation, phonological change across time, and spelling mutability.
So confronted with two similar nominative singulars in the vocabulary, or two 1st person present indicatives, you need to decide whether you’ll count them as the same lexeme or not. And how liberal you are in your counting will decide how many lemmata you count, and how many you dismiss as variants.
To illustrate with English: you won’t count color and colour as distinct lemmata, nor publicise and publicize. You won’t count recieve as a distinct lemma from receive: misspellings happen, and you still need to count those misspellings as something. Though imbed is not a misspelling of embed, but a different derivation, you’d still want to conflate them too as the one headword.
OTOH, some people draw a distinction between racist and racialist. You may not, but if enough people do, you have to treat them as distinct. (The fact that the OED has decided they’re now the same thing does not mean the entire language community has.) That’s a judgement call in itself: people are uncomfortable with morphological variation as much as with any other kind, and people swore to the OED that there’s a meaning distinction between gray and grey too. So there’s no right answers. But there are arbitrary decisions.
Dictionaries normally make those arbitrary decisions for you, because they cross-reference variants to main headwords, and you can rely on their judgement. But dictionaries’ judgement can be as arbitrary as any other’s: the distinctions can be fine—particularly with variant suffixes, as we’ll see. And some dictionaries conflate variants more effectively than others. Kriaras has a lot of phonetic variability, and hides a lot more spelling variability, because of its normalised modern spelling. But it does a reasonable job of indicating which forms are just variants.
By contrast, LSJ’s more discursive entries include derived lemmata and similar lemmata, as well as simple variants, in the one entry. So it’s harder for automated processing of dictionaries, such as the TLG lemmatiser does, to trust that two words cited in the one entry are in fact the same lexeme. Because the automated processing errs on the side of caution, it will distinguish variants as lemmata more than it should, which inflates the lemma count. The TLG lemmatiser currently seens two or more lemmata in some 12,000 LSJ entries; eyeballing, I’d say maybe a fifth of those could arguably be conflated. There are also a number of lemmata in later dictionaries (Trapp and Kriaras most notably) which could also arguably be conflated with lemmata in LSJ—but haven’t been yet, because I haven’t been through all 70,000-odd of their lemmata manually. (I’m not bold enough to guess how many.)
So there is a margin of overcounting lemmata; OTOH, there are many more instances where one could argue lemmata are undercounted, because the lemmatiser conflates variants. *I* won’t argue that, because I’ve been responsible for a lot of the conflating. But this has been driven by a particular take on the vocabulary of Greek: if you’re searching for words through a search engine, you’re likelier to care about meaning than inflection, and you’ll want the search to retrieve any word instance that looks enough like your word to match. You won’t want to skip matches just because the spelling is slightly off. So for that purpose, less lemmata, meaning more search hits per lemma, is a good thing.
This has meant that, where there was doubt about whether a form is a distinct lemma or not, e.g. in LSJ cross-references, or variation between LSJ and Trapp, I’ve usually conflated those forms that I’ve been through manually. That’s not everyone’s purpose. If you’re trying to inflate the count of distinct words of Greek, it’s definitely not your purpose. Even if that isn’t what you’re trying to do, there will be disagreements on how much to conflate.
Part of those disagreements are tied to the dead tree: paper dictionaries have been understandably more reluctant to conflate variants that are alphabetically distant from each other. But there have been times that the choice to conflate hasn’t been clear; and times when I’ve decided not to conflate. (The TLG search engine does display cross-referencing to the user, so that decision is not fatal.)
So what sorts of things may or may not be conflated as the same lemma in Greek? This laundry list will cover at least some of it:
- Dialectal phonological differences
- For some of the ancient dialects some of the time, these differences are so predictable, they’re not even mentioned in LSJ entries. The big example is the different treatment of Proto-Greek */aː/ in Doric and Aeolic (stays α), in Ionic (goes to η), and in Attic (η except after ε, ι, ρ). The second example is Aeolic being accented as far back in the word as humanly possible, something which in the rest of Greek regularly happens for just verbs and compounds. There’s more, and together they all mean that you shouldn’t count Attic βοηθέω, Ionic βωθέω, and Doric βοαθοέω as different verbs for “help”. Nor for that matter Aeolic βαθόημι: the different inflection is normal for Aeolic, and α is a plausible fate for /oaː/. (Just like the Ionic ω is.)
- Classical spelling variations
- If the odd inscription spells βοηθέω as βοιηθέω, that still isn’t reason enough to call it a different lemma. Some spelling variation is endemic to the Classical language, because the pronunciation of those words was in flux—as occurs at any stage of any language. Usually, dictionaries will sweep this spelling variation under the carpet too, in generic cross-references like “κρεω-: see κρεο-“. Once LSJ has worked out that λιπ- was the original pronunciation of compounds meaning “lacking in”, every compound that can be spelled with λειπ- is listed under λιπ- .
… With the inconvenient exception of λειπογνώμων “without an inspector (or tariff, or distinguishing mark)”, for which no ancient authority gives a λιπ- spelling, because the pronunciation was already changing. LSJ does dismiss the mediaeval spellings of the word as λιπογνώμων; but really, who could blame the mediaevals? - Late spelling variations
- λιπογνώμων is very, *very* far from the only instance when mediaeval or even Hellenistic writers could not cope with the increasingly historical spelling of Greek, and spelled words more creatively than LSJ allows. LSJ is an historical dictionary, so its business is to go back to the original phonology of the word where possible, and dismiss everything outside its ambit. But if the manuscripts and papyri spell the word differently, that does not mean it’s a different word. I’ve spent several entertaining months helping the lemmatiser cope with λλ as an alternate spelling of λ, or ι as an alternate spelling of ει, or σζ as an… interesting spelling of σ.
- Hypotheticals
- In making sense of the derivation of words, grammarians would often suggest what they thought the real underlying form of a word was. For example, in discussing ἀζηχής “continuous”, they would suggest the form is underlyingly ἀδιηχής “unseparated” or ἀδιεχθής “unhostile”. I have on occasion conflated these hypothetical derivations with the word they’re explaining, when there was not anything else to be done with them.
- Hypercorrections
- With written Greek increasingly remote from the spoken language, Byzantine writers did increasingly oddball things to make their words sound high-falutin’. Theodore Metochites’ Homer-Through-The-Looking-Glass is one egregious instance, the result of too much Classical learning (“if Homer said νοῦσος for νόσος, then I’ll say σουφία for σοφία”). Theodore Studites’ is another, and the result of not enough Classical learning: I’ll never quite get over him working out that the imperfect of περισσεύω is περι-έσσευε. Throughout the period, there’s a persistent tendency to stress words on the “wrong” syllable. I assume the authorities don’t discuss it because it was beneath their notice; and I assume the Byzantines did it Because They Could.
This means that Byzantines have made up a bunch of variants for lemmata which did not exist, quite artificially. However artificial they are, they turn up in the corpus, so they need to be accounted for; but they shouldn’t be counted as new words. They’re old words in clown outfits. - Language change
- But Classical words don’t only turn up in the corpus in either chlamys or clown outfits. They also turn up in denim. Hm, I don’t know if that analogy is going to work…
Variation in words is also going to result from natural language change. (In fact, differentiating natural and artificial language change is harder than it looks.) As we saw in earlier episodes, the *anr- stem of Proto-Greek, “man”, turns up as ἀνήρ, ἀνδρός in Classical Greek, as ἄνδρας, ἄνδρα in Byzantine Greek, and as ἄντρας, ἄντρα in Modern Greek. The third declension of the original has been done away with, and the Ancient /ndr/ cluster is now spelled differently, leaving the eye-pronunciation /nðr/ to learned forms.
For the purposes that the TLG lemmatiser is put to—searching words in a diachronic corpus—that’s not enough reason to call them different lemmata. ἄντρας is the natural development of ἀνήρ, it means the same thing, so they count as the same thing. But morphologically ἄντρας has already moved on from ἀνήρ, and counting them as the same is a diachronic artifice. It’s only because we’re trying to cover three thousand years in the one corpus, that we’re conflating words three thousand years apart. - Variations in compounding
- Putting stem A and stem B together forms you a new compound AB. That compound should for the most part count as the same word, regardless of slight differences in how the compound is put together. So ὑδατοφόρος “water-bearing” does not show up in the dictionaries, but ὑδροφόρος “water-bearing” does; they should count as variants of the same lemma, since hydro and hydato are allomorphs of the same noun, /húdɔːr/ gen. /húdatos/ “water”. This, already, is a conflation lexicographers will not be equally eager to embrace.
Similarly, Classical Greek has δαφνηφόρος “carrying laurels” (as a ceremonial act), using -η- as a combining vowel; Late Greek no longer used -η- as a combining vowel, so the word turns up there as the more regular δαφνοφόρος. I’m disinclined to call these distinct lemmata. Others may not be. - Variations in inflection
- Here things really do start getting murky: if the inflection of two forms is slightly different, though their stem is the same, should they count as the same lemma? I have allowed them to some times, especially if there is a distinction between an earlier and a later, more transparent or commonplace form. So when Theodore Studites, with his shaky command of classical morphology, uses εὐέλπης, εὐέλπες instead of εὔελπις, εὔελπι, or when he forms the aorist passive participle of εὐκρινέω as εὐκριθέν, implying a citation form εὐκρίνω, I smile benignly, and allow that he’s gotten the Classical form wrong—not that he’s come up with a brand new lemma. When the variants are contemporary, I’m more reluctant to make that conflation. So I have let ἀθλεύω and ἀθλέω remain distinct verbs. That’s not to say I’ve never done such a conflation—especially when LSJ has said “= sq.”; but I’ve had less of a motivation to.
So I have come up with a count of lemmata that conflates some variants and deflates others. In conflating variants, I am coming up with a lower count of lemmata than others might. Which means there is a question mark over the count of 173,000 lemmata I’ve claimed.
Well, good. Like I keep saying, there’s a question mark over any count of words of any sort. But given that the lemmata contain multitudes, it’s worth uncovering how many of those multitudes there are. So I’m going to count up the variants.
The first warning about these counts is that this is still not how you’re going to beat English in counts of words. The OED counts some 250,000 lemmata (and its coverage of Old and Middle English is *not* exhaustive); with variants, that gets up to 615,000. There will be a lot more than 173,000 variants to count in the Greek corpus; but there won’t be 615,000.
The second is to explain what I’m counting.
- I allow any variation in the lexical rot, recorded in the lexical database, to count as a distinct variant. So ἀνήρ, ἄνδρας, and ἄντρας count as three variants, and Doric ἀγέννατος and Attic ἀγέννητος as two.
- I do *not* count dialectal or diachronic change in the same inflection paradigm (including tense stems) as different. So I don’t count both Ionic χώρ-η and Attic χώρ-α, or Doric δωρ-ίσδω and Attic δωρ-ίζω, as different.
- I don’t count dialectal variants in preverbs, because they are not part of the root; so ξυμμαχέω is not counted separately from συμμαχέω.
- I don’t count uppercase and lowercase variants of the same lemma. (But I do count them when they are distinct lemmata.)
- I don’t count active and passive variants of the same root verb as distinct.
- And I don’t count the adhoc respellings that the lemmatiser can do on the fly, to recognised deviations particularly in diplomatic editions.
So. I had 214,381 lemmata in the corpus. Without proper names and Milesian numbers, that came down to 172,646. How many variants does that translate to?
- Variants: 362,947
- Without numbers: 352,895
- Without names and numbers: 286,652.
(I’d guessed around 350,000 variants a couple of postings ago. That’s pretty good. It would be even better, if my guess hadn’t excluded proper names…)
This amounts to 1.7 variants per lemma. I’ll admit to some surprise that the OED ratio of variants to lemmata is more like 2.5: Greek historical spelling should allow for comparable confusion. My suspicion is it does, and I’m discounting adhoc misspellings which the OED doesn’t.
Names are slightly more variable than normal words: 66,243 variants for 37,306 names, which is a ratio of 1.8:1. Foreign names in particular get mangled in several ways, including creative hellenisations: that’s why there are 75 different variants of “Muhammad”, and 43 variants of “Lombard”. (Those examples aren’t fair, since “Muhammad” includes the Turkish “Mehmed”, and “Lombard” also includes the earlier “Longibard”—so again, I’m conflating variants more than some might.)
That count can be whittled down further of course.
- If we discount all variants noted as hypothetical—which were made up by grammarians, and were not used in the actual language, we come down to 274,650.
- If we ignore variation in accentuation, which is mostly a Byzantine hypercorrection, we’re down to 265,233.
- If we ignore the uncertainty between ει and ι, which bedevilled the Koine, we’re down to 260,362.
- Ignore double consonants: 253,891.
- Ignore the distinction between η and α, as a brute-force levelling of Doric and Attic, and we’re down to 247,294.
- Ignore the distinction between smooth and rough breathing (which occasionally tripped scribes up): 245,672.
Even with these common causes of variation excluded, that’s still some 73,000 added variants (41%) that I have not counted as distinct lemmata. That means that one could argue some of them should be counted as separate—although I have trouble seeing how a consistent criterion could be devised, especially over such a large timespan.
So the lemma count may be an underestimate, because of different judgements on what counts as distinct; but at its most inflated, the lemma count will no more than double. In reality, I think the debatable instances are closer to 20% than to 70%. So no, not even this way are we getting to 5,000,000 lemmata.
Lerna VIIb: Lemma counts and proportion of text recognised
We can keep dredging lemmata up to move towards a target of 300,000. But of course for a living language, as Modern Greek now is and as Ancient Greek once was, there is no ceiling in lemmata: people can always make up new words, and do. And because dictionaries will never exhaust what words people come up with, even if they work off a limited corpus, the constructive thing to do is not to say how many lemmata are in a language.
The constructive thing rather is to say, if I know n lemmata in a language, how many word instances in a corpus will I understand? If I know n lemmata, how much of the text I’m confronted with can I make sense of? If a vocabulary of 500 words lets you understand just 70% of all the text you’ll see, you’re in some trouble. If a vocabulary of 50,000 lets you understand 99.7% of a text, on the other hand, that’s one word instance out of three hundred that you’ll draw a blank on. Assuming 500 words a page, that’s around three unknown words every couple of pages. That’s still a lot: if you’re having to run to the dictionary once a page, you’ve got catching up to do in the Word-A-Day club. One word every ten pages—say 99.98% of all word instances: that’s probably more reasonable.
That gives you a statistic of how many word instances are recognised; but when you’re listing the words you don’t know yet, you tend to list unfamiliar word forms, not word instances. So if you come away from your reading of The Superior Person’s Little Book of Words with a list of words you need to look up in the dictionary, you won’t count contrafibularity three times and floccipaucinihilation seven times. You only have to look up contrafibularity once to understand it, so you’ll list it as a single unknown word form.
The proportion of recognised word forms is going to be much lower than the proportion of recognised word instances. The word instances will give you credit for knowing words like and and the and of: hey presto, with those three words, you already understand 20% of all printed words of English! The words you won’t know will tend to be one-offs, occurring just once or twice in a text: it’s rare words, which people don’t come across a lot in texts, that they won’t have needed to learn. But with word forms, and and the and of don’t count as 20% of all printed words of English: they only count as three word forms. The unfamiliar one-offs will make a much larger dent in the size of your vocabulary, than in the proportion of a page you can grok.
Again, the point of this is to say, not that there are n words in a language, which is deeply problematic in ways I’ve gone into great length on. It’s to say that, if you know n lemmata, you will understand n1% of the vocabulary of a corpus (its word forms), and n2% of all the text in a corpus (its word instances). The value of n can go up or down, and the proportions of words you understand can go up or down with it. This means two things which are more useful to keep in mind than any grand How Many Words statement.
First, it’s not about how many words there are ever, so much as how many words are *useful* to know. If there are fifty words which were made up for Joe Blow’s autobiography, and Jow Blow’s autobiography has never been published or indeed sighted outside his kitchen, then those fifty words will not form part of your corpus, so they need not count. Or, if there are three hundred words of phrenology which noone has used for the past century, and they only got used once in a blue moon, then even if those three hundred words do show up in your corpus, they will be marginal enough to cut out most of the time. Tying vocabulary size to recognition allows you to limit the lexicon to what you will actually use, and how frequently you will use it.
The second realisation is the admission this makes, that the size of a vocabulary is asymptotic. People can keep making up words, or using words in increasingly niche and esoteric contexts. If n words let you understand 99% of the vocabulary, then you may well be able to come up with 5n words to recognise 99.9% of the vocabularly, and 20n to recognise 99.99% of the vocabulary, and even 100n to recognise 99.999% of the vocabulary. But by the time you’re up to 99.99% recognised words, you can reasonably ask whether it’s worth spending an extra five years building up your vocabulary, just to deal with the remaining 0.01%.
The answer is no. Dictionaries do not wait forever before they decide they’re done: they have a large corpus, and dip into it fairly eclectically, but they do miss stuff (not just “sausage” in that Blackadder episode on Johnson’s Dictionary). And that’s OK, if the word is obscure enough for the dictionary’s purposes. Dictionaries employ some subjectivity in leaving out words until they think they’re worth taking seriously; but the way a corpus is put together usually filters the obscurities out for you already. If you’re relying on printed text to prove a word is worth describing, you’re leaving out all the made up and nonce words and speech errors that were never written down. Of course, that’s an elitist way of viewing language, and print is nowhere near the barrier it used to be. But it does cut your corpus down to something manageable.
If you’re working on a Classical language, the cruelty of Time (and Bastard Crusader scum), the indifference of scribes, and the snootiness of schoolmasters do plenty of filtering for you as well. That’s why the PHI #7 corpus, which was not subject to the same filters as the literary corpus, has so much distinct vocabulary.
(For an example of why the bones of the Bastard Fourth-Crusader scum should boil in pitch in eternity, see the fate of the text of Ctesias and the other manuscripts unfortunate enough to be in Constantinople in 1204.)
So what sort of recognitions do the figures I’ve been quoting represent? I’m using the five corpora as before, and I’m also differentiating between all word forms, and just lowercase word forms—because proper name recognition lags behind recognition of common words in general. Lowercase word forms is a somewhat crude metric for leaving out proper names, and there are a few TLG editions which follow the e.e.cummings stylings of their manuscripts, leaving names lowercase. But it’s all about the indicative figures, always.
% Instances Recognised | Recognised Instances Ratio | % Lowercase Instances Recognised | Recognised Lowercase Instances Ratio | |
---|---|---|---|---|
TLG + PHI #7 | 99.66% | 1:294 | 99.86 | 1:740 |
TLG | 99.84% | 1:624 | 99.915 | 1:1170 |
LSJ | 99.37% | 1:158 | 99.83% | 1:585 |
Mostly Pagan | 99.964% | 1:2759 | 99.979% | 1:4750 |
Strictly Classical | 99.967% | 1:2993 | 99.975% | 1:4019 |
% Forms Recognised | Recognised Forms Ratio | % Lowercase Forms Recognised | Recognised Lowercase Forms Ratio | |
---|---|---|---|---|
TLG + PHI #7 | 89.56% | 1:9.6 | 94.33% | 1:17.6 |
TLG | 93.91% | 1:16.4 | 95.59% | 1:22.7 |
LSJ | 89.51% | 1:9.5 | 95.77% | 1:23.6 |
Mostly Pagan | 99.16% | 1:118 | 99.42% | 1:172 |
Strictly Classical | 99.56% | 1:226 | 99.62% | 1:263 |
Let’s go through this slowly.
The lemmatiser understands the Strictly Classical corpus—literary Greek up to iv BC—quite well. It only fails to pick up 1 in every 226 distinct word forms, which mean you have go through on average 2993 word instances—say six pages of text—before you hit a word it does not understand. But you can ignore capitalised words, because they’re typically proper names, and we don’t expect to have those in our vocabulary anyway. You can make sense of “Alcidamophron slaughtered the servant of Tlesipator” more readily than you can “Jack fnocilurphed the smorchnepot of Jill”. If we do ignore capitalised words, the lemmatiser fails to understand just 1 in 263 word forms, and over eight pages of text on average before it finds a problem word. As machine understanding of morphology goes, that’s not bad at all.
So the 55,000 lemmata that the lemmatiser knows of for the Strictly Classical corpus get you through eight pages of Greek on average as smooth sailing. And that is the real meaning of “55,000” lemmata here. Of course, that’s an eight page average across a corpus that is still not terribly homogeneous; and some bits of the corpus are going to be understood a lot better than others. The lemmatiser understands all 199,000 word instances in Homer, for instance: 400 pages by our reckoning, not just 8. On the other hand, the Strictly Classical corpus also includes Aeschylus, whose transmission has been corrupted frequently, and where the lemmatiser falls over 63 word instances of 74,000—once every couple of pages.
With the Mostly Pagan corpus, which sticks to literary texts up to IV AD, the lemmatiser understands the corpus almost as well: 76,000 lemmata give you all but 1 in 172 word forms, and in fact because the later texts are slightly more homogeneous linguistically, almost 10 pages on average of text before there is a problem word. So 76,000 lemmata for Mostly Pagan is about as meaningful a claim as 55,000 lemmata for Strictly Classical: it lets you understand almost the same proportion of text in the corpus. There’s bound to be more lemmata than that in the corpus, that the dictionaries have not officially recorded; but it’s not going to be overwhelmingly more. I’d guessed maybe 500 lemmata underestimated for the Strictly Classical corpus, with 1,500 unrecognised word forms. The Mostly Pagan corpus has 5,000 unrecognised word forms, so I’ll guess maybe 2,000 underestimated lemmata.
The LSJ corpus is much less well understood, partly because it includes technical writing, but mostly because it includes the more unruly texts from the inscriptions and papyri, with their distinct vocabulary and grammars, and confusing spellings. We claimed 124,000 lemmata here, but that only gets you one word form unrecognised per 23; including potential proper names, it’s as bad as one word unrecognised in ten. And you’ll be stumbling over one word per a page and a bit. Our unrecognised word forms are now up to 35,000 lowercase forms. That does not necessarily mean 10,000 more lemmata unaccounted for, given the problems in spelling and grammar; so I’m reluctant to guess how many more lemmata you need to get to the same level of recognition as with the Strictly Classical corpus. But there are clearly more lemmata to go.
You can see the trouble the papyri and inscriptions bring more clearly in the last two counts, which include and exclude them. Without them, the TLG corpus has one lowercase word form in 23 unrecognised, and a little over a word per two pages unrecognised. That’s not that bad for the claimed 162,000 lemmata, given the bewildering diversity of texts in the corpus. Let the inscriptions and papyri back in, and you now miss a word form for every 18, and a word every one and a half pages. And that’s for increasing the size of the corpus by just a twentieth.
So the lemma counts are more and less reliable for different periods of Greek: we can tell how much text they allow you to recognise in different corpora, and we can allow that there are cut-offs for how many lemmata it is useful to know in a corpus. The lemma count is still not open-ended, so long as the corpus is finite. (That’s the thing about langue instead of parole: the corpus size of *potential* text, using language as a theoretical system, is infinite.) And the word form coverage of the lemmatiser will keep improving, as an ongoing project; as I’d already mentioned before, TLG word form recognition has gone up from 90% to 94% in the past two years. But the lemma count does peter off.
So let me give one last batch of numbers to illustrate the relativity of lemma counts: how much less of a corpus do we understand, if we cut down on the number of lemmata. I’ll do that using the word instances per lemma count for the TLG. Because there is a fair bit of ambiguity in Greek morphology, many word forms are ambiguous between two lemmata (and a few between more than two); so there is some double counting of instances to be had. As a result, the 202,000 lemmata recognised in the TLG corpus—proper names and not—account for 112 million word instances, though the corpus really contains only 95 million.
So if we take 112 million as our baseline, how many instances are accounted for by admitting less lemmata?
Lemmata | Word Instances | |
---|---|---|
100 | 61,166,253 | 54.44% |
500 | 78,932,451 | 70.25% |
1,000 | 86,575,286 | 77.06% |
2,000 | 94,016,671 | 83.68% |
5,000 | 102,370,243 | 91.11% |
10,000 | 106,926,324 | 95.17% |
20,000 | 109,884,248 | 97.80% |
50,000 | 111,727,544 | 99.44% |
100,000 | 112,191,095 | 99.85% |
120,000 | 112,251,181 | 99.91% |
150,000 | 112,302,895 | 99.95% |
180,000 | 112,332,895 | 99.98% |
190,000 | 112,342,895 | 99.98% |
202,000 | 112,354,703 | 100% |
There’s your Zipf’s Law in action. The table neatly parallels what Wikipedia says for vocabulary size, quoting a 1989 paper presumably on English: “We need to understand about 95% of a text in order to gain close to full understanding and it looks like one needs to know more than 10,000 words for that.”
The difference between 100,000 and 200,000 lemmata accounts for just 163,000 word instances out of 112 million, around one word in 700. The difference between 180,000 and 200,000 accounts for less than a word every ten pages. So there’s a very very long tail of increasingly rare words: the last 60,000 lemmata each occur just once in the 95 million word corpus, and the last 25,000 lemmata before that occur just twice. There’s a *lot* of these one-offs, which is why all together they account for an unknown word every four pages. And we need dictionaries for words we don’t come across every day, not words we do.
Still, they are one-offs (hapaxes). They’re not useless—they were clearly useful to whoever used them that one time in the 2,500 year span of the corpus. But noone needs all 60,000 of them at once. And by the time you’re down to lemmata that happen just once or twice in a roomfull of books (a small room admittedly), you can appreciate why real human beings walk around with close to 20,000 lemmata in their skulls, and not 200,000. For the rest, we have guessing from context (and related words); and we have dictionaries. And once Classical Greek became a bookish language, the Byzantines used dictionaries too.
Lerna VIIa: Classical and Late vocabulary
Here, I’ll try making some sense of how the vocabularies of Greek have shifted between the corpora.
This is where we got to.
Lemmata | Excluding Proper Names | ||
---|---|---|---|
TLG + PHI #7 | (viii-XVI, +tech +christ +inscr/pap) | 214,381 | 172,646 |
TLG | (viii–XVI, +tech +christ -inscr/pap) | 201,823 | 162,009 |
LSJ Corpus | (viii-VI, +tech -christ +inscr/pap) | 159,636 | 124,215 |
Mostly Pagan | (viii–IV, -tech -christ -inscr/pap) | 99,485 | 76,067 |
Strictly Ancient | (viii–iv, +tech -christ +inscr/pap) | 66,390 | 54,898 |
The corpora have varying mixes of including “technical” texts, Christian texts, and inscriptions and papyri. In case it wasn’t obvious, “Christian texts” means texts about the Christian religion, which have a distinct editorial and linguistic tradition deviant from the Classics. We’re not banning authors for their creed, but for what corpora their texts fit into. The Mostly Pagan corpus chooses to end with Synesius rather than Nonnus, but both of them started as pagans and ended as Christian bishops.
I’m going to try and work on how the vocabularies differ in time between the corpora. Two postings ago, we cut down on post-classical and suspect-looking analyses, by restricting out word form counts to forms of good standing and pedigree. This permitted us to describe a more homogeneous corpus. We can put the same restrictions on our lemma counts.
Lemmata | Excluding Proper Names | |
---|---|---|
TLG + PHI #7 | 204,393 | 167,640 |
TLG (viii–XVI) | 192,342 | 157,302 |
LSJ Corpus (viii-VI) | 151,962 | 120,018 |
Mostly Pagan (viii–IV) | 97,906 | 75,845 |
Strictly Ancient (viii–iv) | 65,842 | 54,743 |
Forms of Good Standing (no numbers, hypothetical, hypercorrect, unattested tenses, uncertain inflection, anomalous inflection, transliterated Latin)
Nary a dent on the Strictly Classical corpus or even the Mostly Pagan corpus, which contain literature. But we’ve got rid of words made up by grammarians as etymologies; e.g. ἀόλλησις “thronging” as an etymology of ἀλλᾶς “sausage”. (The jokes just write themselves, don’t they.) We also got rid of Latin terminology, which didn’t always make it to the dictionaries when it was undigested Latin; e.g. νερεδιτάς hereditas “inheritance”. What with that and Milesian numbers, we can take 10,000 lemmata out for the overall corpus.
hereditas ends up as /nereðitas/? Yes, Byzantine lawmen liked transliterating Latin /h/ as <n>; I’m not clear on why.
Now comes the hatchet. How many lemmata can be called Ancient grammatically, even if they only show up in later texts? That sounds nonsensical, right? Surely if a lemma is first attested post-Classically, it counts as post-Classical. Well, it is, but I’ll crank the handle anyway.
- I’m leaving out proper names now, because the lemmatiser only occasionally has assigned them period.
- OTOH, lemmata do by default get called Late in the lemmatiser’s database if they are unique to Lampe or Trapp, and specifically Demotic if they’re unique to Kriaras. Finding I’d missed a class of lemmata for tagging was what made me start revisiting all the counts.
- This also means that by default, if the lemma is in LSJ, it’s counted as classical.
- LSJ stops nominally at VI AD (and on occasion with scholia, a lot later); so it decidedly includes Koine—but lemmata have not been consistently periodised in the lemmatiser as Koine as distinct from Classical. So the counts of lemmata tagged as classical are inflated enough not to be useful.
- I’m cranking the handle anyway.
- Beyond that, though, any verb derived from a Classical verb (by prefixing a preposition) still counts as linguistically Classical, because that process was fully productive from the beginning. So ἀντιδιαλοιδορέομαι “to be mocked thoroughly in response” is attested only in Trapp; but all of ἀντί, διά and λοιδορέω are Classical, and the combination was licensed in antiquity, so the compound of all three is counted as Classical.
- The same goes for lemmata formed through derivational morphology—unless the word does show up in a later dictionary. So ἀβελτίωτος “unimproved” could have been formed at any time of Greek, from ἀ-, βελτιόω, and -τος. But because it is explicitly attested in Trapp, it is counted as a new Byzantine word.
- The discrepancy between how I handle prefixing (always counts as Ancient) and suffixing (counts as a new word in later dictionaries) is, as you may have guessed, an artifice of how the lemmatiser has been implemented.
- OTOH, some later text has crept into the nominally Ancient corpus—notably in Testimonia (later descriptions of authors in literature), and more so in “technical” texts, which were often written in Koine. In fact, LSJ has plenty of Koine in it—and as we’ll see, a lot more Koine in technical and daily-life texts than in literary texts, something which should surprise precisely nooone.
- And yet… I’m still cranking the handle
So if I crank the handle, and exclude analyses that the lemmatiser thinks, for better or worse, are post-Classical, what do I get?
Lemmata | |
---|---|
TLG + PHI #7 | 132,098 |
TLG (viii–XVI) | 122,579 |
LSJ Corpus (viii-VI) | 110,417 |
Mostly Pagan (viii–IV) | 73,260 |
Strictly Ancient (viii–iv) | 54,176 |
Forms of Good Standing and Classical Pedigree
Let’s try and make sense of this. The two corpora we’ll compare are the LSJ corpus, which goes up to VI AD and excludes Christian writings; and the complete TLG + PHI #7 corpus. As always, excluding proper names:
All lemmata | Classical lemmata only | ||
---|---|---|---|
TLG + PHI #7 | 167,640 | 132,098 | Difference: 33,000 Middle + 2,000 Modern lemmata |
LSJ Corpus | 120,018 | 110,417 | Difference: 10,000 Middle lemmata |
- There are 47,000 lemmata that turn up only after VI AD, after the LSJ corpus.
- There are another 10,000 lemmata in the LSJ corpus (8%) that are marked as late (“Middle”). Given that the LSJ corpus does include Koine texts, that whether a lemma got marked as Koine or not is a little haphazard, and that the technical Koine texts in the LSJ are linguistically messy, that’s not that surprising.
- By contrast, the Mostly Pagan corpus, which skips the papyri and technical texts, has just 2,500 middle lemmata (3%). Literary texts avoid linguistically innovative lemmata. Technical texts account for another 2,600 middle lemmata; the remaiing 5,000 are from papyri.
- Of the 47,000, almost half—22,000—are linguistically still Classical. Some of these are late lemmata that just happen to have made it to LSJ. (A lot of those are legal Latinisms.) Some of those are derived lemmata.
Restating: a quarter of all lemmata in our corpus turned up only after VI AD; but half of those new lemmata don’t look new to the lemmatiser at all: they look classical. Because of productivity of lemmata vs. accidents of tagging, somewhat less than a quarter of all lemmata in our corpus appear to be post-Classical to the lemmatiser.
Let’s look at these new lemmata more closely, by looking at the most frequent lemmata in each category. The distinction between “linguistically Classical” and “linguistically Middle” does not turn out to matter much, because it’s an accident of what has been included or excluded from LSJ. OTOH the distinction with “linguistically Modern” (i.e. Early Modern Greek) is quite revealing. Be warned too that the frequency of lemmata is all about the types of text included in the corpus.
And because a little bit of vernacular has leaked into Photius’ lexicon, which was after all compiled pretty late, I’m excluding the lexica from the LSJ counts again. This pushes 15,000 lemmata back into the mediaeval period—140 of them linguistically Modern, 1500 of them linguistically Middle—and the rest of them linguistically Ancient. They’re dictionaries; of course they have lots of one-off words.
So what are the most frequent lemmata new to mediaeval Greek?
Linguistically Ancient (34,641)
- ἐκπόρευσις (85) “proceeding forth”
- στιχολογία (735) “recitation”
- περιβόλιον (561) “garden”
- ἐναντιοφανής (533) “apparently contradictory”
- πανσέβαστος (453) “most august”
- λατινικός (388) “Latin”
- πακτεύω (375) “make a pact”
- κορμίον (365) “trunk of body”
- χρυσορρήμων (365) “golden-speaking”
- παραταγή (365) “order for payment”
Linguistically Middle (25,111)
- θεοτοκίον (1425) “hymn to Virgin Mary”
- μετόχιον (1370) “monastic property”
- κονδικτίκιος (928) “relating to repossession of property”
- κανείς (671) “noone”
- ὀκτώηχος (863) “hymnal with all eight modes”
- πρωτοσπαθάριος (696) “chief of imperial bodyguard”
- ἱεραρχία (638) “hierarchy”
- τώρα (622) “now”
- γυρίζω (582) “I return”
- μισέρ (556) “monsieur”
Linguistically Modern (2,526)
- ἔτζι (749) “so”
- ἠμπορέω (424) “I can”
- ἀμή (290) “but”
- τέτοιος (261) “such”
- κάθε (202) “each”
- ἀντάμα (187) “together”
- ἀπαυτοῦ (125) “thence”
- κάποιος (114) “someone”
- βουλέω (113) “I want”
- ὁλόρθος (111) “upright”
(A couple of texts nominally in LSJ’s time period still contain νά, which I treat as diagnostic of Modern Greek; but again, these lists are only meant to be indicative.)
The ancient-looking new words deal with theology, logic, law, or the public sphere: disciplines which kept innovating their own specialist vocabulary. The middle-looking words largely deal with the church: the theotokion count is inflated relative to its counterparts because the word is used to signal a section for a lot of hymns in the corpus. There are a couple of novel grammatical words in middle Greek (“noone, now”). But for Modern Greek all the most frequent words are grammatical. And what that foreshadows is that Modern Greek has a distinct grammatical system than Ancient Greek, while Middle Greek is a lot closer to the Ancient grammatical system. That’s no surprise, given that much of our Middle Greek corpus is Atticist to begin with.
Finally, for what it’s worth, this is how many lemmata the lemmatiser thinks are linguistically Attic:
Lemmata | |
---|---|
TLG + PHI #7 | 127,169 |
TLG (viii–XVI) | 118,697 |
LSJ Corpus (viii-VI) | 105,960 |
Mostly Pagan (viii–IV) | 70,726 |
Strictly Ancient (viii–iv) | 51,666 |
Forms of Good Standing and Attic Pedigree
That’s not worth that much, it must be said, since Attic is taken as the default dialect in the lemmatiser. Though dialect word forms eliminates a substantial number from the corpus, the lemma count itself is not affected much: an Attic-compatible word form usually turns up someplace.