Ἡλληνιστεύκοντος

Lerna VId: A correction of lemma counts

By: Nick Nicholas | Post date: 2009-07-10 | Comments: No Comments
Posted in categories: Ancient Greek, Linguistics, Mediaeval Greek
Tags: Ancient Greek, Byzantine Greek, Lerna, lexicography, lexicon, TLG

Last post had its share of egg on my face, showing systematic overcounts of word forms in the corpora. This post is another healthy serving of omelette, correcting the lemma counts given in Lerna VIa. The overall story is:

There are less distinct word forms in the PHI #7 corpus than I thought
There are less scribal alternate forms left in PHI #7: if an editor thought they knew better than the scribe, the scribe’s form is left out of consideration
There is less dialectal and orthographic wiggle-room allowed to PHI #7
So as a result of all this, the count of lemmata distinctive to PHI #7 has crashed: ignoring proper names, 3,800 lemmata that the lemmatiser thought it saw in PHI #7 are no longer there.
The count has still crashed, even though I’ve added a fair few lemmata to deal with PHI #7—the most frequent names, the overlaps with Trapp’s dictionary, a few stragglers from DGE—as well as some dialectal grammar and some more respelling rules. I’ve picked up around 800 non-names and 1200 proper names; so I’m down by 1800 lemmata from before, rather than 3800.
I could have kept going to add more names than that, but it’s been two weeks already, for gorsakes.
OTOH, because I’ve added extra names in particular, recognition of the TLG has slightly improved. So there are a few more lemmata for just the TLG-based corpora. (*Very* few.)
I also did some debugging of orthographic variation in lemmata, which resulted in some conflation of variants.
So if you ignore proper names, the TLG lemma count… actually ended up losing a few lemmata. (Again, *very* few: a couple of hundred lemmata each way.)

So.

	Lemmata	Excluding Greek Numerals	Excluding Proper Names
TLG + PHI #7	216,234 214,381	211,794 209,952	175,791 172,646
TLG (viii–XVI)	201,680 201,823	197,448 197,591	162,219 162,009
LSJ (viii-VI)	159,636	156,720	124,215
Mostly Pagan (viii–IV)	99,426 99,485	98,593 98,652	76,145 76,067
Strictly Ancient (viii–iv)	66,437 66,390	66,078 66,031	55,003 54,898

I also had a tally including also-rans analyses:

	Lemmata
TLG + PHI #7	220,560 218,727
TLG (viii–XVI)	206,161 206,470
LSJ (viii-VI)	166,387
Mostly Pagan (viii–IV)	107,257 107,512
Strictly Ancient (viii–iv)	73,427 73,532

In all of this, I’ve not been paying the PHI #7 corpus that much attention, though I did make a point of slipping it into the LSJ corpus. (The LSJ coverage of inscriptions and papyri are in fact why I called up PHI #7 in the first place.) I knew there would be extra lemmata there, and this lemma count is the PHI #7 disc’s chance to shine. PHI #7 has added 6.5% more word instances to the TLG’s, but 16% more word forms, and 6% more lemmata! That’s phenomenal!

… What on Earth am I talking about? Remember Zipf’s Law: the cumulative number of word forms that turn up is inversely proportional to the instance count for each word form. It’s a Long Tail. If you add 6% more word instances, by the time you’re already at 95 million instances, you should be getting… well, I can’t do the maths, but you should be getting at most hundreds of new lemmata, not (as the table above shows) 12,000, of which only a couple of thousand are proper names. The 10,000 more lemmata of ordinary vocabulary shows you that the inscriptions and papyri—the Greek of daily life and of far flung dialects—has a very different vocabulary from the Greek of literature.

Of course, that you get 16% more word forms in PHI #7 means there’s a lot of different inflections in the corpus that lie outside the TLG’s ambit, because of all the non-literary dialects represented in the inscriptions. It also means a lot of misspellings that didn’t belong in the TLG, as well.

In VIa, I went into an extended riff extrapolating how many more lemmata of Greek could turn up. Let me attempt that again, this time with more detail on proper names—but *not* including proper names in the final estimate.

The reason proper names don’t belong in a final tally is worth restating, because not enough people are laughing at the notion. When we want to know how many words of English there are (which we shouldn’t, but I’ve already been through that), we don’t add the New York State White Pages to the Oxford English Dictionary, and we don’t start screen-scraping geonames.org. We recognise that proper names are a different kind of thing from normal words (although the boundaries are fuzzy); and we also recognise that it’s problematic to say a name belongs to one language and not another.

Does Κόρινθος count as a Greek name, even though it has the prehellenic telltale -νθ-? Well sure it does. Does Ομπάμα count as a Greek name? Or Σαίξπηρ for Shakespeare? Surely not. But what about the older declinable transliteration Σακεσπήριος? Doesn’t that at least look Greek? What about Αὐρήλιος? But then again, what about Ἰσαάκ? Is Αμπντουλάχ not a Greek name? But does it become a Greek name when it was hellenised, as the Byzantines did, as Ἀβδελλᾶς? And is counting these names as part of the vocabulary of Greek a meaningful thing to do?

Well, better not to count proper names in the final tally at all; but let me add the counts I do know of, just in case someone is curious.

Right now, the TLG lemmatiser knows about almost 42,000 proper names. That includes most names of the Strictly Classical canon; a fair few names from later literature (including lots of Byzantine surnames), the names in Smith’s Dictionary of Greek and Roman geography , and the thousand-odd names I was shovelling in over the past fortnight, to deal with the inscriptions and papyri.
Pape-Benseler went into its second edition in 1863, which increased it by a third. It covers geographical, personal, and mythological names in Ancient literature, and has some coverage of later stages. It has good coverage of such inscriptions as were known at the time, and is starting to notice papyri—though remember, this is thirty years before the discovery of Oxyrhynchus. And the dictionary is reasonably good about conflating variants.
Benseler does not say how many names he has in total, but he does say that Alpha under his revision went from 3820 names to 6120. Extrapolating based on LSJ, that should mean 38,000 names overall. There are clearly lemmata in Pape-Benseler that aren’t in the TLG lemmatiser: I add 500 names because of dealing with PHI #7, and that was only dealing with names occurring 10 times or more in PHI #7. How much more am I missing? No idea. But I’d be surprised if it was more than 10,000.
In the following, I need a sense of how many of these names are personal, and how many are geographical. The Heidelberg word lists for papyri are a bit more reluctant to conflate variants than I prefer, but at least they list personal and geographical names separately: 8838 personal, 2637 geographical. Good enough for me, I’ll say personal numbers :: place names are 4:1.
1863 is a long time ago in epigraphy, and the Lexicon of Greek Proper Names has been running for the past three decades to record the torrent of names found on inscriptions. It avoids mythological names (which are covered well enough in literature and Pape-Benseler), and it also does not do geographical names. It’s ongoing, but its online search knows of 35,000 distinct names of people (whereas Pape-Benseler has 38,000 names of people, places, and gods). Now, the TLG lemmatiser recognises 17,600 distinct names, personal and geographical, in the ancient inscriptions on the PHI #7 disc. Guessing that 14,000 of those are personal names (4:1 ratio), that means it’s missing at least 21,000 personal names.
The Leuven projects recognise 16,000 personal names in the papyri (with 7,000 extra variants), using the Duke Documentary Papyri corpus. The TLG lemmatiser recognises 9,600 distinct names, personal and geographical, in the same corpus on PHI #7. Guessing that 7,700 of those names are personal, it’s missing at least another 8,000 names.
Some of the Leuven names will overlap with LGPN; but the Egyptian names won’t. Let’s say that all up, we’re owed at least another 27,000 personal names. And using that 4:1 ratio again, another 5,000 place names. Heidelberg counts 9,000 personal names to Leuven’s 16,000, and Heidelberg counts 2,600 geographical names; extrapolating up, that’s consistent with 5,000.
That’s not even scratching the surface of Byzantine and Modern names (let alone Σαίξπηρ or Ομπάμα, or the Thessalonica and Environs phone book). But so far, we can guess 42+27+5=74,000 names.
Flipping things around, there are 72,000 unrecognised capitalised words in PHI #7. That does not mean 72,000 missing names: lots of these will be misspellings of known names that the lemmatiser isn’t dealing with, or different inflections of the same name. And those names are in the scope of LGPN and Leuven. I’d say the personal names are already accounted for in the 27,000 (say) personal names of the two initiatives.
There are a further 42,000 unrecognised capitalised words in TLG. Most of these won’t be in LGPN and Leuven—though some will be in Pape-Benseler. Most of these by far are from post-Classical texts, and they include ancient gazeteers. (Ptolemy’s Geography alone accounts for close to 3,000 unrecognised names.) How many of these are legitimate novel proper names? Again, no idea, but by this stage we’re getting into one-offs, because all proper name word forms occurring more than 7 times in the TLG have been added to the database. I’ll guess 30,000. There’ll be some overlap with Leuven and LGPN, but not a lot, because many of these names are Byzantine.
As mentioned, the TLG is maybe 70%, maybe 75% complete for Byzantine literature, and only starting to go into Early Modern literature. It does have a lot of Byzantine surnames through church deeds (which account for 5,000 unrecognised capitalised words); so it’ll have a reasonable cross-section. I haven’t gone through the Byzantine proposopographies though (285-641, 642-1265, Palaeologan), to work out how many surnames they’ve unearthed in sum.
And I have not spent quality time with the Attica or Thessalonica or Nicosia phonebooks.
So at least 70,000 proper names to go, adding up to something like 110,000 proper names, and that count only goes up to the Fall of Constantinople.

Anyone who wants to start boasting of the 110,000 proper names of Two And A Half Thousand Years of Greek needs to be smacked upside the head with all three volumes of the Dictionary of American Family Names, and have the printout of all 8,000,000 places on geonames.org dropped on their foot. Because all of those count as proper names of One Year of English, by the same criterion.

(The Blogger Writing These Lines enjoyed contributing to the Dictionary of American Proper Names, even before he realised its value as a tool of percussive persuasion.)

So. Banishing proper names, we’re left with 173,000 lemmata, as guesstimated. How much is left to go again? As it turns out, I’m doing the same guesstimates as before—but they make more sense without including proper names:

I keep my guesstimate of 20,000 lemmata more from Trapp (including texts not yet added to the TLG and volumes not yet published), and 10,000 lemmata more from Kriaras (ditto). That’s 203,000.
There are words in LSJ that are not represented in this corpus. The biggest gap is the mediaeval Latin-Greek glossaries, with 1,000 missing lemmata; but there are several other oddities. The latest I’ve encountered, under ἐλεφαντουργική “of or pertaining to ivory-working”: the 1161 AD commentary to the astrologer Paul of Alexandria, writing in 378—and last published in 1588. (The irony here is, the same adjective turns up in the rather more mainstream Heliodorus, a century beforehand.) But again, once the PHI #7 texts are in, and with the changes in text editions between the original LSJ and the TLG—not to mention the rejected scribal forms—I don’t think there’s more than 3,000 lemmata to add. That takes us to 206,000.
I’m inclined to revise my extrapolation for DGE downwards. Volume I updated may have 3500 lemmata not in LSJ, but it’s competing not only with Bauer, Lampe, and Trapp, but also with the LSJ Supplement—which on its own adds 10,000 lemmata to LSJ, and which also has made a point of covering more inscriptions and papyri. I haven’t taken the time to do any counting with DGE. It’s a long plane trip tomorrow to Montreal—so maybe I will.
But there’s no way Volume I has 3,500 lemmata not also in LSJ/Bauer/Lampe/Trapp/LSJSupp. DGE looks like taking 20 volumes if and when it finishes. (I wasn’t planning on living until 2100 AD to find out.) If there’s just 500 novel lemmata in Volume I, that means 10,000 novel lemmata all up; if 1000, then 20,000, as I proposed last time. I’m feeling jaundiced, but I’ll still give them 20,000. That takes us to 226,000 lemmata, up to the fall of Candia.

Ούφ. On those figures, English still wins, 🙂 though not by much. The level of precision I’ve given is of course illusory, and in a following post I will tackle what is a more sensible question: how much vocabulary do you need to recognise n% of a text. But these counts should at least be indicative.

Lerna VIc: A correction of word form counts

By: Nick Nicholas | Post date: 2009-07-06 | Comments: No Comments
Posted in categories: Ancient Greek, Linguistics, Mediaeval Greek
Tags: Ancient Greek, Byzantine Greek, Lerna, lexicon, TLG

This post fixes counts given in Lerna Va and Lerna Vb, with corrected counts from the PHI #7 disc—and a couple of weeks’ work on the archaic dialects and proper names of the PHI #7 corpus. I’ve also fixed several errors in how I was counting forms as unique. The end result is that the previous counts were inflated all up by 15%.

This post is boring—a bunch of numbers—but necessary for the record. Because the counts are dependent on how the lemmatiser recognises words, and the lemmatiser is not static (and neither is the corpus), these counts are not definitive; but they are more correct than the last reports. The major bug fix (as far as I can tell!) was that I’d forgotten to factor out case and accentuation for not only the raw word forms, but also their normalised counterparts; so all the normalised word counts were off by some 10%. But because the main conclusions were comparing vocabulary sizes relative to each other, they still hold.

That’s why testing is a good thing, right?

I’m added a new corpus into the mix: the LSJ Corpus is meant to approximate the coverage of LSJ. It excludes any Christian-related writing, apart from the Scriptures themselves. Otherwise (and that’s a big Otherwise), it includes all pagan authors up to VI AD, including technical authors. It also includes the ancient inscriptions and papyri from PHI #7. The LSJ corpus additionally includes the lexica of Hesychius, Photius, and the Etymologicum Magnum, which were written later, but (some of the time) reach back earlier. It still leaves out the scholia on Classical literature, which explain Ancient texts with Byzantine words.

It also leaves out two Demotic texts which have ended up in collections of Ancient authors in the TLG, one under Pseudo-Hippocrates, one under the Hippiatrica. I’m taking those out of the Mostly Pagan and Strictly Ancient corpora too. The fact that a clearly XVI AD text has been lumped in with a v BC corpus should give you pause: use the author dates on the TLG with caution—they apply to the authors, but not to all the spurious works included under the author’s name.

Lerna Va

Counts of unique strings in the corpora

		Word Instances	Word Forms
TLG + PHI #7	(viii-XVI, +tech +christ +inscr/pap)	102,005,245 101,684,658	1,861,358 1,815,540
TLG (viii–XVI)	(viii–XVI, +tech +christ -inscr/pap)	95,475,128	1,567,892
LSJ Corpus (viii–VI)	(viii-VI, +tech -christ +inscr/pap)	34,746,312	1,147,454
Mostly Pagan (viii–IV)	(viii–IV, -tech -christ -inscr/pap)	16,312,159	605,335
Strictly Ancient (viii–iv)	(viii–iv, +tech -christ +inscr/pap)	5,464,913 5,463,292	334,428 334,187

This is where the differences start. By correcting incomplete word indications, hyphenation, and rejected scribal forms in PHI #7, I’ve lost 400,000 word instances, and 46,000 distinct word forms.

You can also see that going from Mostly Pagan to the LSJ corpus almost doubles the count of distinct word forms. That’s adding in two more centuries of pagan literature, technical writing, inscriptions, papyri, and late lexica. The PHI #7 texts account for around a third of that increase; the rest comes from the technical writing and lexica. The lexica include a large number of one-off words, and a lot of loose Byzantine spelling. Technical writing includes even more loose Byzantine spelling, because these texts are not closely bound to Atticist literary norms.

But it also includes a lot of idiosyncratic vocabulary—medical, astrological, engineering, mathematical, not to mention all the random place names in Ptolemy and the other geographical texts. Technical writing also encompasses grammatical and philological commentary—which often means grammarians just making up tenses and cases to explain words. So there is a lot of distinctive vocabulary in technical writing; but there is also a lot of inflated vocabulary.

Stripping case and forms without diacritics

I’ve fixed the calculations to take out more forms with partial diacritics—so I’m now making sure that all of ανδρι, ἀνδρι and ανδρί are folded under ἀνδρί. So less forms from here in are considered truly distinct:

	Word Forms
TLG + PHI #7	1,649,083 1,545,491
TLG (viii–XVI)	1,376,016 1,355,062
LSJ Corpus (viii–VI)	1,001,079
Mostly Pagan (viii–IV)	562,744 555,843
Strictly Ancient (viii–iv)	314,887 312,255

Restricting to recognised forms

Though I’ve added a thousand-odd proper names and some Arcadian and Cretan grammar to the lemmatiser, it still struggles with the PHI #7 corpus, as you’d expect: it’s now understanding 62% of all word forms instead of 59%. There’s 73,000 capitalised word forms in PHI #7, and 21,000 uncapitalised, that the lemmatiser has no idea about. For the TLG corpus, the equivalent is currently 42,000 capitalised word forms, and 43,000 uncapitalised that are going unrecognised—and the TLG has seven times more word forms more than PHI #7.

So there are a *lot* of vocabulary, particularly proper names, that are unique to the PHI #7 corpus, and that the lemmatiser does not yet understand. In fact, I already know there should be 16,000 distinct proper names in the papyri alone, as I mentioned last post. But once again, if I am using the lemmatiser to make morphological judgements about distinct word forms, I can’t count words that the lemmatiser doesn’t understand. So I have to pretend those words don’t exist, for any remaining counts to mean anything.

OTOH, it’s been a month, and recognition of the TLG corpus has gone up (partly because of this series of posts). The word counts are not static.

	Word Forms
TLG + PHI #7	1,435,391 1,391,855
TLG (viii–XVI)	1,282,298 1,272,773
LSJ Corpus (viii–VI)	905,044
Mostly Pagan (viii–IV)	557,574 551,651
Strictly Ancient (viii–iv)	313,354 311,428

Normalisation of forms (crasis, apostrophe, respellings)

Yeah, more bugs here. I’ve been case-folding word forms up to to this point; I, uh, think I forgot to case-fold the normalised word forms as well. Which ends up making quite a difference.

	Word Forms
TLG + PHI #7	1,352,303 1,152,682
TLG (viii–XVI)	1,232,209 1,101,191
LSJ Corpus (viii–VI)	736,932
Mostly Pagan (viii–IV)	539,469 481,424
Strictly Ancient (viii–iv)	301,005 275,703

Eliminating nu movable

Here too I changed the way I was considering a form to have nu movable—I relied on the morphological analysis rather than doing a blanket transformation. So less forms now get conflated.

	Word Forms
TLG + PHI #7	1,307,842 1,125,784
TLG (viii–XVI)	1,189,688 1,074,767
LSJ Corpus (viii–VI)	720,855
Mostly Pagan (viii–IV)	519,498 470,096
Strictly Ancient (viii–iv)	289,812 270,115

Eliminating non-words (abbreviations, Greek numerals, or geometric lines)

The more aggressive folding of diacritics I’ve put in means there aren’t many of these left at all.

	Word Forms
TLG + PHI #7	1,300,717 1,125,699
TLG (viii–XVI)	1,183,120 1,074,683
LSJ Corpus (viii–VI)	720,800
Mostly Pagan (viii–IV)	518,321 470,096
Strictly Ancient (viii–iv)	289,275 270,093

So, sheepishly, I find that I overestimated unique word forms by say 120,000, and the errors in how I was handling PHI #7 made me overestimate by another 50,000. When I was comparing Three Thousand Years Of Greek to Slovenian and Telugu, my average word forms per thousand word instances in the TLG was 12.6; it is now 11.3. Telugu still has 30.8, so it still wins…

Lerna Vb

Forms of Good Standing (without: hypothetical, hypercorrect, uncertain inflection, anomalous inflection, transliterated Latin)

	Word Forms
TLG + PHI #7	1,267,434 1,101,948
TLG (viii–XVI)	1,158,529 1,053,549
LSJ Corpus (viii–VI)	708,669
Mostly Pagan (viii–IV)	515,275 468,698
Strictly Ancient (viii–iv)	288,305 269,448

Forms of Good Standing and Pedigree (linguistically Classical)

	Word Forms
TLG + PHI #7	1,135,915 980,867
TLG (viii–XVI)	1,041,520 938,084
LSJ Corpus (viii–VI)	676,114
Mostly Pagan (viii–IV)	505,302 458,756
Strictly Ancient (viii–iv)	285,856 266,891

Forms of Good Standing and Cecropian Pedigree (linguistically Attic)

	Word Forms
TLG + PHI #7	1,020,232 889,759
TLG (viii–XVI)	952,993 857,008
LSJ Corpus (viii–VI)	604,107
Mostly Pagan (viii–IV)	458,933 415,869
Strictly Ancient (viii–iv)	248,914 232,008

Lerna VIb: A derailing of lemma counts

By: Nick Nicholas | Post date: 2009-07-03 | Comments: No Comments
Posted in categories: Ancient Greek, Linguistics, Mediaeval Greek
Tags: Ancient Greek, Byzantine Greek, Lerna, lexicography, lexicon, TLG

You may have noticed an extended radio silence for the last couple of weeks in the series counting lemmata. The people at the Magnificent Nikos Sarantakos’ blog, where the good fight against Lerna is fought, know why: I found some problems in the way I was counting lemmata in the inscriptions and papyrus corpus (PHI #7), which I’ve been nowhere as familiar with as the TLG corpus. As a result, I’m down 2,000-odd lemmata from where I thought I was. Because I spent lots of posts on how contingent and provisional any count of lemmata is, that should not be that big a deal: a ±1% in the lemma count is within the bounds of what can happen when you fix first-cut errors.

Still, it’s embarrassed me enough, now that people are starting to quote the Lerna VIa count of 211,794 (including Nikos Sarantakos, fighting the good fight), that I tried to get to the bottom of it. In the process, I’ve worked to treat the PHI #7 corpus less cursorly than I had done. Cleaning up problems in the PHI #7 markup, and clueing the lemmatiser in on some of the peculiarities of the dialects in the corpus, mean that the counts would give a more accurate picture of what was going on with those texts. The problem is, the longer I spent fixing my handling of PHI #7, the more the lemma count fell—*even as I was busy adding lemmata from elsewhere* (DGE, Pape-Benseler, Foraboschi). Erk. The counts are more accurate (with a catch I’ll talk about), but they’re not what they were.

I’m going to air some of the dirty laundry here, to cement the point yet again that any count of lemmata is going to be unstable. After that, next post is going to revise the counts that need revising. Then, the promised posts that got derailed: how many of these lemmata count as Ancient; relating lemma counts to recognition percentages (which is the only way lemma counts are meaningful); and distinguishing word variants from lemmata.

The first issue was when I wanted to count how many lemmata should be considered Ancient. I realised I had not been counting a couple of thousand lemmata from Lampe’s and Trapp’s dictionary (I-VIII AD and “IX-XII” AD) as post-classical. That did not particularly affect the accuracy of recognition for the TLG (as I confirmed by rerunning the program), but it was distorting the numbers: there are less “word forms of good pedigree” than I said there are. So you’ll get new numbers for that.

The second catch was when I found a bug in how I was extracting word forms from the PHI #7 corpus, which meant that several hyphens were being ignored—so a hyphenated word would be extracted as two separate words. Once I fixed that bug, I also noticed that some of the markers that a word was fragmentary weren’t being picked up. For instance, I knew that notation like …]atisatio[ indicated bits of a word were missing from the papyrus or inscription; I didn’t know that PHI #7 was also using dashes, like – – ]atisatio[ – –. Fixing these problems results in less complete word instances extracted—but of course, more correct word instances extracted. Even if some lemmata that looked like being there were no longer recognised, there should be more correct long words turning up. So that should not cause any drastic drops in the size of the vocabulary.

The next three problem fixes seem to be what’s caused issues. Papyri are spelled phonetically, by the norms of Koine Greek, so the lemmatiser allows for some spelling variation: ι for ει, for instance, or ω for ο. Inscriptions and legal deeds from Late Byzantium need to allow for a lot more spelling variation, because of the many Ancient phonemes that had ended up pronounced identically: so ι could now be a misspelling of any of η ει οι υ υι.

Archaic inscriptions, on the other hand, may have a narrower range of respellings than papyri (depends on how early), but they also have different spellings of their own, because they use different versions of the Greek alphabet: ω and ου were Ionic innovations in the alphabet, for example, and what conventional Ancient orthography spells as ω and ου, most inscriptions before iv BC spell as just ο. So unlike papyri or church deeds, a system dealing with inscriptions has to allow ο to stand for ω or ου.

The lemmatisation run over PHI #7 that I’d reported was allowing all possible respellings from all periods indiscriminately. So an XIV AD document was being allowed the same latitude in spelling as a vii BC document.

Yeah, you can see how that might be a problem. I fixed this by allowing different respelling rules for the three parts of the corpus: the ancient inscriptions, the papyri, and the Christian inscriptions (which run all the way to Ottoman times). There’ll still be some wrong respellings, because each part corpus spans a long period. But it’ll be a lot better than allowing XIV AD iotacism in a vii BC text. Of course, restricting respellings means that lemmata that were being over-recognised in texts now aren’t. That’s fair enough.

I also tried to restrict the lemmata that were allowed for each part of the corpus, to prevent absurdities. Modern Greek words couldn’t be allowed for Ancient texts of course, but they do show up in the late Christian inscriptions. The ancient inscriptions do keep going well into Roman times, so I couldn’t ban Koine lemmata from there; but I did try to keep recognition plausible, by blocking from the papyri and ancient inscriptions any words unique to Trapp’s dictionary.

That’s underestimating both Trapp and the papyri. The papyri keep going until Greek yielded to Arabic in Egypt—a generation or so after the Islamic conquest, so VIII AD. Trapp, OTOH, badges itself as IX-XII AD—but it also sets out to fill in gaps left by other dictionaries, so it can be the only place where late papyri get covered. So some lemmata that should have been allowed for the papyri were being blocked. But having checked, only 150-odd legitimate lemmata were affected (and are now back in). So that wasn’t the major disruption.

The other problem, as far I can tell, was that PHI #7 allowed in its markup both the word or phrasing the editor thinks the text is saying, and (in special brackets) the odd wording the scribe actually wrote; e.g. lemmatisation {4lmmeatsiantion}4. If an editor has decided to correct lmmeatsiantion as lemmatisation, I decided, I shouldn’t be trying to analyse both. The editor’s fix should count as the word instance for recognition: the “misspelling” (as the editor has judged it) shouldn’t be considered an independent word. It looks like, in the process, some words LSJ says existed no longer turn up, because LSJ didn’t trust the editor as much as I do. But all texts from a papyrus or inscription get filtered by the editor publishing it, and making sense of it—just like all the literary texts in the TLG. So that’s the consistent thing to do.

All up, skipping proper names, 3,500 odd lemmata are no longer turning up as recognised. OTOH, 700 lemmata are now newly turning up that weren’t before. Those numbers are still subject to change; but most of the 3,500 lemmata that disappeared should have disappeared. The scribal originals like lmmeatsiantion‘s arguably shouldn’t have disappeared, and I may end up revisiting them down the road. But I’ve already spent two weeks trying to deal with the vanished 3,500, and I shouldn’t be holding postings up much longer.

To compensate for the missing 3,500, I went through the PHI #7 corpus, and looked more closely at what kinds of words weren’t being recognised—making sure that words occurring frequently in the corpus were accounted for. That involved some tweaking in the allowable spelling variations, and some filling in of the more obscure dialects’ grammar.

I had no idea what the Arcadian first declension genitive was like—or how it’s spread. Arcadian τρίταυ /trítau/ “of the third” corresponds to Homeric masculine τρίταο /trítao/ (Attic τρίτου /trítoː/), but it’s also spread to the feminine, displacing Proto-Greek and Doric τρίτας /trítaːs/ (Attic τρίτης /trítɛːs/). Arcadian τρίταυ reminds me of the Esperanto -aŭ ending; I wonder if I’m the first person to have had that mental short-circuit.

Beyond that, if the dictionaries that the TLG lemmatiser already knew about didn’t account for frequent word forms, I checked it in DGE. After all, part of DGE’s reason for existence was to broaden the coverage of LSJ into new finds in inscriptions and papyri. For lowercase words (i.e. excluding proper names), I went through all word forms occurring more than twice in the corpus; DGE is up to εκ-, and I did end up adding new lemmata from DGE, unique to this corpus.

The count of lemmata I added to the vocabulary from DGE… was 12. This surprised me, especially because even between α and αλ—for which DGE went back and redid Vol. I—there were a few word forms still unaccounted for on PHI #7. Going down to word forms occurring just twice or once will account for a lot more than 12 lemmata from DGE; but it won’t account for thousands. The remaining gaps even after DGE is something I’ll be looking at again: I’m curious to work out what’s going on. Of course, PHI #7 is nowhere near a complete corpus even for 1995 when it was published—let alone now, with the continuous stream of inscriptions and papyri being transcribed and published. Only the Athenian curse tablets from Audollent’s 1904 collection, for example, are in. (So when I looked at how καταχθόνιος and χθόνιος are used in the tablets for a paper, I had to do eyeballing as well as keyboard searching.)

I also wanted to improve the recognition of proper names particular to PHI #7, where the lemmatiser is really struggling: It now recognises 46% of all capitalised words, vs. 89% of all lowercase words. As I keep saying, proper names shouldn’t count at all, but a couple of thousand instances of Πεθέως drawing a blank from the lemmatiser was a bit much for me. Moreover, if the lemmatiser isn’t told about a proper name, it will end up making wrong guesses about what the lemma actually is. There are several inscriptions-only names that I was able to find in Pape–Benseler; but the big store of unrecognised names are in the papyri. And there’s a simple reason why so many names from papyri drew a blank from the Greek lemmatiser: they’re not Greek names, but Egyptian.

Of course, adding 500 or 1000 Egyptian names to improve Greek word recognition sounds suspect, right? But no more suspect than adding Hebrew names to improve recognition of words in the Septuagint, or Roman names to improve recognition of Cassius Dio. That, after all, is why proper names don’t count when you count lemmata.

I’m using Foraboschi as my Egyptian phone book; it’s the update to Preisigke’s Namenbuch, which Foraboschi updates—and which seems to be AWOL at the moment in transit from Monash University to the University of Melbourne. My bloody fault for not waiting to drive over to Monash on the weekend—it’s just 10 minutes up the road from my place.

People whose day job it is to look at names in papyri (several projects based at Leuven) have already been counting the proper names in the Duke Database of Documentary Papyri, which is what PHI #7 uses for the papyri. So they’re doing the electronic counterpart to the dead tree phone book I’m sampling. The Leuven projects have come up with 26,000 name variants in the corpus, in 16,500 lemmata—and the majority of them are Egyptian, and unknown to other corpora of Greek (although a few of them make it to Athanasius of Alexandria or the Desert Fathers, who after all were also Egyptians). I’m not proposing to sit down and add 16,500 lemmata to the lemmatiser database: this is not my day job. I’m aiming at adding around 1,000, as triage prioritising the most frequent names; that’ll account for uppercase word forms turning up 10 or more times in the corpus.

So, I’m going to tell you I know of 1,000 Egyptian names in PHI #7, when the Leuven papyrologists know there are 16,500? Why yes. Just like I’m telling you I know of 35,000 proper names in the TLG, when there are 42,000 uppercase words in the TLG unaccounted for. I don’t know how many names there are, and that’s not what the current lemma count is about. But I do know how many names account for n% of the corpus, for a suitably large number of n%. The name count is not open-ended, but it is pretty large—larger than for common nouns.

In fact, the good folk at Heidelberg Uni Centre for Research on Antiquity have produced lists of lemmata in papyri. They’ve got around 22,000 lemmata, half of them names. So Leuven knows more names than Heidelberg knows—presumably because they’re using a smaller corpus. I know less than either. And the more papyri turn up, the more names and nouns and verbs will turn up. Lemmata are open-ended.

But while Heidelberg’s 11,000 names can turn into 16,500, it won’t turn into a million. And while 175,000 lemmata without names can turn into 173,000 when I fix PHI #7—and maybe 220,000 once both the dictionaries and the corpus is complete up to 1453 or 1669—it’s not going to turn into five million. Even if you count all the variants of dialect and spelling and phonology in lemmata, as I’ll attempt in the final installment—which is how Leuven get from 16,500 names to 26,000, and the OED gets from 230,000 lemmata to 610,00 variants: even then, you’re not getting to a million. (My current back-of-the-envelope calculation without names is around 350,000.)

OK, I’ve got some Egyptian names to go before I revise the published counts.

Lerna VIa: For Zeus’ Sake, How Many Words?

By: Nick Nicholas | Post date: 2009-06-18 | Comments: 3 Comments
Posted in categories: Ancient Greek, Linguistics, Mediaeval Greek
Tags: Ancient Greek, Byzantine Greek, Lerna, lexicography, lexicon, TLG

[Counts in this post have been corrected in Lerna VId]

At long last, after nine posts of teasing, will I finally give the punters a count of lemmata of Greek?

Why yes. Yes I will. And then for a change, I will also set to work inflating it, to extrapolate from the current corpus and lexicon I have access to, to how much larger it could conceivably get.

Ready for it? The count of lemmata, known to date to the TLG lemmatiser, and recognised in the four corpora we’ve set up as they stand to date, is…

	Lemmata	Excluding Greek Numerals	Excluding Proper Names
TLG + PHI #7	216,234	211,794	175,791
TLG (viii–XVI)	201,680	197,448	162,219
Mostly Pagan (viii–IV)	99,426	98,593	76,145
Strictly Ancient (viii–iv)	66,437	66,078	55,003

1: Lemma Counts

Did I have to quibble even here? Why, of course I did. The lemmatiser makes sense of Milesian numerals like αϠοα = 1971 and χξϛ = 666, but including them in the vocabulary of Greek as lemmata is a bit much. And dictionaries do not include proper names. So if you’re going to compare the headword count in LSJ with the headword count in OED, you won’t be including proper names in your count. Proper names are not exactly open-ended in count, but they do work differently from core vocabulary, they come from a lot more sources and cultures, and knowing lots of people doesn’t really prove your vocabulary is more expressive.

So if we’re comparing the TLG corpus to the OED’s, we’ll say 162,000. OTOH, adding proper names is most of my fun with the TLG lemmatiser: Athenian courtesans one minute, Albanian chieftains the next. In terms of the lemmatised search engine, they are search targets like any other. So if we’re not comparing the TLG corpus to any dictionaries, we’ll say 202,000 lemmata.

Not So Fast!

Only if you’d asked me two years ago, I’d have told you 231,000. And that was when I was recognising 90% of all word forms—as opposed to now, when I’m recognising close to 94%. Does that mean the Greek language has lost 25,000 lemmata in the intervening two years, even as the lemmatiser now recognises 60,000 more word forms? No. The lemmatiser has just gotten more discerning about when it claims a new lemma has shown up.

There is a lot of ambiguity in dealing with three thousand years and six dialects of Greek, and incomplete dictionaries. The lemmatiser has been allowed to make up its own lemmata (more on this below); it does this to cover gaps in dictionaries, whether they’ve gotten to pi or omega. But in the past two years, the lemmatiser has been constrained to make up a new lemma only if it doesn’t have a “legitimate” alternative, already recorded in a dictionary.

The lemmatiser has also gotten better at conflating variant stems under the one lemma. That’s a huge issue, which will have to wait for a couple of posts: the number of stems I count as distinct lemmata is in several ways different to the number LSJ counts as being distinct lemmata. There are 216,000 lemmata in the overall corpus, following the lexical database’s definition of when two stems belong to different lemmata. But its definition will be different to someone else’s definition. As we’ll see, it’s not always clear from the dictionaries when they consider two lemmata distinct, or whether they should when they do.

All this should make you distrustful of any lemma counts I give you. As well it should. Counting lemmata is an artefact of how you set about counting lemmata; different criteria, and different methods of analysis, will give different results. As will different sizes of corpora. So in this and the next couple of posts I’m going to stretch the lemma count, then shrink it, then stretch it again. (And then I’ll bring this thread to an end; there’s Mariupolitan dialect waiting to be blogged about.)

Fuzzy Boundaries

To begin stretching, I’m going to allow for the uncertainty of lemmatisation. The TLG lemmatiser, confronted with much too much ambiguity, ranks potential analyses of word forms as belonging to different lemmata. The counts I’ve just given are for the “winning” lemmata for each word form. If I include the “also-rans” in the analyses, then I’ll also be counting lemmata which never give the preferred analysis for any word form—but which the lemmatiser keeps in reserve, in case one of them turns out to be correct after all. You will get search results if you look up those lemmata in the TLG. You will also get lots of warning, saying “but this form is probably X instead, and that form is probably Y instead.”

If I include these also-rans in the counts, the counts go up to:

	Lemmata
TLG + PHI #7	220,560
TLG (viii–XVI)	206,161
Mostly Pagan (viii–IV)	107,257
Strictly Ancient (viii–iv)	73,427

2. Lemma counts, including also-ran analyses

This gives a curious result. There are lemmata which the Strictly Ancient corpus rejects as implausible for all its forms; but somewhere in the subsequent mediaeval morass, a word form turns up for which the rejected lemma makes the most sense after all. Of course, a lot of those lemmata rejected for Strictly Ancient Greek will turn out to be Byzantine after all. So it makes sense they’d become more acceptable, once bona fide Byzantine texts are included.

More dictionaries

There are two constraints on what a lemmatiser recognises: how many words it knows about, and how many words you’re asking it to recognise. Increase the corpus—as we did by excluding and including Byzantine texts—and it will find more lemmata. Give it more dictionaries—as the TLG did by adding Lampe, Trapp, and Kriaras—and it will also recognise more lemmata. So these counts could be bigger with more texts (and more texts are being added), and more dictionaries.

You can increase your dictionary size by allowing the lemmatiser to do what human beings do: make words up. You can make words up from whole cloth, as a random but plausible sequence of sounds. But that’s fairly rare in human language. What is much more usual is making up words based on existing words, using rules present in the language (derivational morphology). The lemmatiser does know something about Greek derivational morphology: as a result, the TLG counts include some 15,000 lemmata that are not in its dictionaries, but are derived from lemmata that are. Two thirds of these are from prefixing prepositions to verbs, which is quite productive in Greek. One third is from derivational morphology forming new stems through suffixes, and those proposed analyses are more tentative.

But derivational morphology will only catch some words. Otherwise, if a lemma is unrecorded in the dictionaries that the lemmatiser has access to, then that lemma won’t show up in the list of lemmata recognised: you have to tell the lemmatiser that πιθανοθηρία exists, for it to make sense of πιθανοθηρίας. If you give the lemmatiser access to more dictionaries, it will recognise more words.

Much of the TLG is Byzantine, and more of the TLG is going to be Early Modern; so the fact that the dictionaries of both stages of the language are currently stuck at pi means there are future volumes of those dictionaries that the lemmatiser hasn’t been given access to yet (because they don’t exist). That means a lot of lemmata in the corpus going unrecognised, and being missed in these counts.

My back of an envelope can beat your back of an envelope

How many? Here start the back of the envelope calculations: take them with several satchels of salt. Trapp has finished six volumes out of a projected eight, and adding Trapp’s lemmata to the lemmatiser has accounted for 25,000 lemmata newly recognised in the TLG. The remaining two volumes (to be completed by 2013) should add another 8,000 lemmata to the TLG. So 202,000 will go to 210,000—give or take a couple of thousand. Remember those 15,000 lemmata the lemmatiser is recognising, even though they’re in no dictionaries? Some of them will turn up in the forthcoming volumes of Trapp.

The same holds for Kriaras. With the 2.5 million words of Early Modern Greek in the TLG, adding Kriaras’ dictionary to the lemmatiser accounts for 2650 lemmata. Since Kriaras is up to 15 volumes of a projected 19, that would mean we’re owed another 550 lemmata. Except, the TLG is not going to stay at 2.5 million words of Early Modern Greek: it’s expanding deliberately into the Early Modern corpus, and there’ll be a lot more lemmata added to the TLG as it does so. The 15 volumes of Kriaras have added 9900 lemmata to the lexical database, so with with the 2650 already seen in the corpus, we can expect another 9900 lemmata once the entire Early Modern corpus is in and Kriaras is completed. That takes us to 220,000 lemmata.

In fact, because we haven’t run out of Byzantine learnèd texts to data enter into the TLG corpus either, 8,000 more lemmata from Trapp is an underestimate. There are 35,000 headwords in Trapp that the TLG does not already recognise from its other dictionaries; but the current corpus only accounts for 25,000 of them. There’s some spelling variation and derivational morphology clouding the results, but all up, and assuming headwords and lemmata are the same (which they’re not), we should expect not 8,000 more lemmata once all the texts are in (and all the spelling variation accomodated for), but 20,000. That takes us to 232,000 lemmata.

“A headword is not a lemma”?! But the definition of “lemma” *is* “headword”! I’m being a little idiosyncratic in my usage: the source dictionaries each have their own headwords, but I’m calling “lemma” the canonical lexeme that the lemmatiser uses, to bind those headwords and variants together—in case it merges two different headwords from its sources into the one lexeme. In other words, I distinguish dictionaries’ print “headwords” from the lemmatiser database’s “lemmata”.

Even with Strictly Ancient Literary Greek, there are word forms that the dictionaries are missing, because editions have changed, or new fragments have turned up, or lexicographers (who often copied each other) had a blind spot. The DGE do use the TLG to compile their list of words, so there’s a good chance they’ll catch the gaps in the TLG corpus at least. But the corpora are fluid, for reasons we’ve already discussed. So πιθανοθηρία “hunting for possibilities” turns up as a variant reading in Plato’s Sophist; because it’s a variant reading, lexicographers have not been panicked to include it. The dictionaries don’t (yet) register anything that deals with ὀκταχοίνικον “eight choinixes heavy” in Aristophanes, or πεντηκοντακισμυρίους “five hundred thousand” in Polybius. (That’s the other blind spot of lexicographers: numbers are boring.)

The count of missing lemmata won’t be massive for antiquity: there are currently just 1,600 word forms unrecognised in the Strictly Ancient corpus, and from inspection, skipping proper names and geometric lines, there’ll be a lot less than 500 lemmata’s worth there. Still, 500 is 500; and it’s thousands, not hundreds, of skipped lemmata moving forwards from Strictly Ancient to Mostly Pagan. Moreover, there are a couple of late ancient texts that, once added, will give us a lot of lemmata: we’re owed 1000 lemmata from the old Latin-Greek dictionaries, a couple of hundred from the Hexapla.

PHI #7 has even more treasures to unlock—although it’ll have to be someone else’s job to do the unlocking. Remember that Vol I of the DGE in the second edition has 3500 lemmata not in LSJ, from α to ἀλλά; extrapolating, that would mean around 80,000 lemmata all up not in LSJ, if and when DGE finishes. How many of those lemmata are already going to be in Bauer, Lampe, and Trapp? My guess is, a fair few; DGE is nowhere near as Pagan-centric as LSJ.

But there are also lemmata specific to the inscriptions and papyri, and not recorded elsewhere. The lemmatiser failed to deal with 40% of the 300,000 word forms unique to PHI #7; and that corpus is not the latest and greatest in papyrology and epigraphy. With 6 word forms per lemma in the Strictly Ancient corpus (which has about the same number of forms), that could mean 20,000 more lemmata unaccounted for. I’m sure that maths is completely wrong, but let’s pretend it’s not: that would mean that, if Kriaras, Trapp, and DGE were completed and added to the TLG lemmatiser, the overall lemma count for PHI #7 and TLG would go from 216,000 to 268,000. Give or take several thousand and more and more satchels of salt.

Keep going? The Triantaphyllidis dictionary of Modern Greek (the contemporary language) has around 47,000 headwords. There’ll be substantial overlap with LSJ, let alone with Kriaras. Maybe 10,000 more? Maybe 15,000? And there’s the dialects of Modern Greek to count too. It’s starting to look like all the lemma of any form of Greek ever spoken anywhere will be around 300,000.

And is that count meaningful? Of course not. Modern Greek is borrowing words from English all the time. (Some of them, it even spells with Greek characters.) And if I did a lemma count for every Italic language spoken over the same time span in the general area of the Apennine Peninsula—from Iron Age South Picene through to modern Bulgnais, via Latin, Standard Italian, and a lot more than six literary dialects in between—then I’d probably get more than 300,000. And of course, any lemma count spanning more than a century is not linguistically kosher anyway. But hey, people ask.

(Bulgnais? The dialect of Bologna. Sounds like Provençal with a head cold. See description in Bulgnais.)

Back to LSJ

You may have noticed that I was only half-heartedly using counts of dictionary headwords in all of this, even though dictionary writers have a bigger corpus than I do, and a more authoritative sense of what should count as a headword. Mapping headwords to lemmata is more complicated than you might think, especially if you’re spanning millenia.

For instance, are ἀνήρ and άντρας the same lemma? We’ve decided they are. But the headwords look completely different, because they are two thousand years apart. And this doesn’t just happen going from Ancient to Modern Greek; it happens from Ancient to Byzantine Greek, and even between Ancient dialects. So there will be less lemmata than headwords. On the flip side, some variants in the printed dictionaries are being treated as separate lemmata, because there’s no consistent indication when they’re variants and when they’re more loosely related forms.

For the record, I’ll note that the 1940 LSJ has 122,000 headwords by my reckoning, and 6,000 of them are cross-references; that leaves 116,000 headwords. The LSJ supplement of 1996 has an additional 10,000 headwords, which makes it 126,000. These headwords span the Mostly Pagan corpus (75,000 lemmata that aren’t names or numbers), plus the inscriptions and papyri (at least 10,000 more lemmata), plus a bunch of technical vocabulary that I’d skipped in excluding Galen and his fellows. If I put Galen and his fellow technical writers back in, but I still leave out the Church Fathers (like LSJ does), and I also add in the papyri and inscriptions from PHI #7 (but not the “Christian Empire” inscriptions), then I get a corpus pretty much like LSJ’s corpus. And my lemma count for that corpus, leaving out numbers and names, is 98,333.

98,333 isn’t 126,000; but then, lemmata aren’t the same as headwords, some LSJ lemmata have disappeared with new editions, we’re missing a few late ancient texts, the lemmatiser is really struggling with understanding inscriptions. In addition, the lexica of Hesychius and Photius alone, which are outside the Mostly Pagan corpus (but document older stages of Greek) account for close to 9000 headwords in LSJ, and well over 1000 for the LSJ Supplement. With them taken out, 98,333 is certainly in the same neighbourhood as 116,000.

I’ll have more to say about headword-to-lemma mapping in a couple of posts, anyway.

For Zeus’ Sake, How many lemmata of Greek?

So. With the evidence available to me right now, with the current status of the TLG corpus and lemmatiser, and my satchels of salt, and OK, OK, enough already…

How Many Ever? Depending on how long a time window you put, and whether names count or not, anything from 55,000 to 300,000.
How many lemmata of Ancient Greek? Depends again, but if names don’t count, and we allow Synesius and not St Athanasius—i.e. LSJ’s Pagan-centric definition of Ancient Greek—then 98,000 is a reasonable first guess. Though the way DGE is going, and with new material showing up in the sands of Egypt, add maybe 20,000 lemmata to that.
How many lemmata of Literary Ancient Greek? Depends for a third time, but if time didn’t stop with Aristotle but Synesius (and names still don’t count), 75,000. If time did stop with Aristotle, 55,000.
If names count, Ancient Greek goes up to 124,000. Literary Ancient Greek goes up to 99,000. Homer-to-Aristotle Greek goes up to 66,000.

No, that’s not a straight answer. You want a straight answer, don’t do lexicography. And in the next couple of posts, I’ll complicate the answer more again.

Lerna Vb: Forms of Good Pedigree

By: Nick Nicholas | Post date: 2009-06-15 | Comments: No Comments
Posted in categories: Ancient Greek, Linguistics, Mediaeval Greek
Tags: Ancient Greek, Lerna, morphology, TLG

[Counts in this post have been corrected in Lerna VIc]

In the last post, we did some pruning of the word form count of our corpora, and came up with some numbers. We also noted that, once you pruned away the 137 forms of ἀνήρ, you’re still left with 42 forms of ἀνήρ.

(Did I say 43? I miscounted. Dangerous thing to admit, with all these numbers flying about. But you should be taking those numbers with a grain of salt anyway. As I’m going to keep saying.)

42 is a lot more than the 11 forms ἀνήρ should have, based solely on the Attic dialect. Here, we’re going to look at where the remaining 31 forms came from, and what that tells us about the morphological heterogeneity of the TLG corpus. We’re also going to keep pruning at those numbers we came up with last time, and see if we can arriving at something like a count of Good Reliable Attic word forms.

The Attic forms of ἀνήρ are shown today in glorious Galatia SIL:

The classicists among you will have picked up that a bunch of the remaining forms are Epic or “poetic”. Another 12 of them:

The tricky proto-Greek stem, *anr-, shows up in Epic with the variant stem a(ː)nér-:

	Sg	Du	Pl
Nom		ἀνέρε	ἀνέρες
Gen	ἀνέρος		ἀνέρων
Dat	ἀνέρι		ἀνέρεσι, ἀνέρεσσι, ἀνέρσι
Acc	ἀνέρα		ἀνέρας
Voc	ἆνερ

The multiple choices are typical of Epic: Epic is a conventional, mixed dialect, and it was handy for Epic to have multiple choices, to fit the metre that the dialect was used in. Hence the variation in the dative plural between -si(n), -esi(n), and -essi(n).

The lack of ἀνδρ- forms in the table, btw, doesn’t mean Epic literature avoided the ἀνδρ- stem. Homer used both. It just means that we’ve already checked off the ἀνδρ- forms for Attic. But because Epic inflections can also appear on the ἀνδρ- stem, the Epic count also includes a fourth dative plural, ἄνδρεσσι, which we did not count under Attic.

That leaves 19. We can pick off four forms of ἀνήρ as Modern Greek:

Of course, treating Ancient ἀνήρ, ἀνδρός as the same lemma as Modern άντρας, άντρα is a bit of a leap, and it shows the problem with having a single vocabulary try to span three thousand years: there is a continuum from ἀνέρος to ἀνδρός to ἄνδρα to άντρα, but the endpoints are far apart. Still, spanning even a century in a corpus raises problems, because language is a moving object. And on the flip side, much of Greek literature—including the Epics themselves—are attics full of relics. Much like any literary language, really, just over a longer timespan. So we’ll treat these as the same lemma (because the TLG has the one search engine for everything); but we’ll note that this is a difficult judgement to make in general—and that it has limited synchronic reality.

A further 8 forms look Epic (both ἀνερ- and ἄνδρ- stems), but are accented further back than they would be in Epic Greek. That should make them Aeolic:

We have very, very few literary texts actually in Aeolic. Five of these eight forms do actually turn up in what literary Aeolic we have: ἄνηρ (Alcaeus, Julia Balbilla), ἄνδρος (Sappho), ἄνερος (Theocritus), ἄνδρων (Theocritus, Alcaeus), ἄνδρεσι (Alcaeus).

Of the rest, ἄνερα shows up in fragments of Euripides and Numenius, and ἄνδρασι in fragments of Diocles and Phylarchus. Scribal errors? Maybe; at any rate, there’s nothing Aeolic about any of those authors.

The oddest of the eight is ἄνδρι. The form shows up in Jacoby’s collection of the Fragments of Greek Historians. This collection gathers up the bits of ancient historians who were not preserved in intact books, and it gathers them from wherever it can; lots of fragments come from citations in later authors. Jacoby has ἄνδρι in a passage by the historian Ion of Chios, as cited in Athenaeus. That means the passage in question turns up twice in the corpus: once in Jacoby’s edition of Ion, and once in Kaibel’s edition of Athenaeus. (That kind of duplication happens quite a lot in the TLG, though it involves small bits of text, so it does not inflate the word count all that much.)

The thing is, Kaibel’s edition of Athenaeus has the word as the normal ἁνδρί. Is this a typo in Jacoby? Is this an earlier version of the text of Athenaeus? Is this an emendation to Athenaeus by Jacoby, because he knows something about Ion that I don’t? I don’t know, and I’m not burning right now to find out. The point is that this kind of variability does happen in the corpus, and it does increase its morphological diversity more than it should.

So of the eight Aeolic forms, three don’t occur in Aeolic texts, and just look like misaccentuations. But this kind of misaccentuation turns out to be routine in Byzantine Greek: in fact, it accounts for most instances in the corpus of the first five “Aeolic” forms. This misaccentuation is too frequent a feature of Byzantine Greek to be an accident or scribal whim. It is a kind of systematic hypercorrection: “I’ll misaccent this word because it will sound more récherché.” So it’s not like Didymus the Blind or St Athanasius are aping Alcaeus specifically; they’re just randomising where the accent goes, as part of their game of Greek.

We know that they aren’t aping Alcaeus, because the Byzantine don’t only put the accent where the Aeolians would have put it; they also put it where noone would have put it. So Byzantine misaccentuation also accounts for four forms of ἀνήρ stressed on the final syllable:

This leaves us with three last forms of ἀνήρ.

To get from Ancient ἀνήρ to modern άντρας, you need to switch from the third declension to the first declension, because the third declension was thrown out in Modern Greek as too hard. (There’s some Lazarus—or Zombie—third declensions in the contemporary language, but outside of -ις, -εως plurals, people are uncomfortable with them.) This means that the Ancient nominative ἀνήρ became the Byzantine nominative ἄνδρας. That nominative does turn up in the corpus, but it’s spelled identically to the Attic accusative plural ἄνδρας, so we’ve already crossed it off our list. It also means, though, that there is an accusative singular ἄνδραν in the corpus, which soon became Modern άντρα. So that’s where that form has come from.

The second form is ἄνδραις. This is a dative plural, turning up in one church hymn, of that Byzantine first declension variant ἄνδρας. It is also an old-fashioned spelling of the Demotic accusative plural (which would now be spelled άντρες), for reasons of morphological analogy denialism that I’m not going to get into here.

The last form is ἀνρός, and it’s not any form of Greek at all. It’s proto-Greek: it’s Herodian, reconstructing (correctly) what the original genitive of ἀνήρ must have been:

τὸ δὲ ἀνδρός κατὰ συγκοπὴν γενόμενον ἐκ τοῦ ἀνέρος ἐξ ἀνάγκης ἐπλεόνασε τὸ δ. οὐκ ἠδύνατο γὰρ εἶναι ἀνρός χωρὶς τοῦ δ, ἐπεὶ τὸ ν πρὸ τοῦ ρ οὔτε ἐν συλλήψει δύναται εἶναι οὔτε ἐν διαστάσει.
When andrós was formed by syncope [deleting a phoneme] from anéros (anéros > *anrós > andrós), /d/ was a necessary redundancy. For /n/ cannot directly precede /r/, either within a single word or between words. (Herodian De Prosodia Catholica p. 406 Lentz)

In fact, we’d say the proto-Greek was *anrós to begin with; but given the poor track record of Greek etymologists in general, Herodian gets cut plenty of slack from me.

We’ve just accounted for the 42 “legitimate” forms of ἀνήρ, and we can see some problems with the range of forms we’ve found:

11 are in one dialect.
12 are in a different dialect—albeit the literary dialect that almost all Classical literature draws on.
5 are in a third, marginal dialect
3 look like they’re in the third, marginal dialect, but are really just Byzantines making accents up. As are most instances of the previous 5.
3 are also Byzantines making accents up, in the opposite direction.
4 are in Modern Greek—and you can argue about the extent to which it is the same language at all.
2 are Almost-Modern Greek
1 is a hypothetical reconstruction of proto-Greek by a Roman-era grammarian (and the Byzantines that copied him).

All of these forms are Greek, in one way or another. But counting all of proto-Greek *ἀνρός, Modern άντρα, Poetic Aeolic ἄνερος and Attic ἀνδρός as genitives of “man” should make you nervous. These are not all part of the same linguistic system. We can concede Epic mixed with Attic, because everyone who wrote literature had Homer in mind; literary languages are not pure and uniform langues. (Spoken languages aren’t either—although a dialect with four different datives rightly makes people suspicious.) But listing ἀνρός and άντρα together… that’s weighing down the scales.

We whittled down the word form count in the previous post to something more reasonable—something that wasn’t lurching at every change in casing or apostrophe. But there are still oddball forms in the corpus, and it would be useful to filter out some of the more problematic word forms, to get a more accurate sense of what is going on in the language—to try to restrict the word form count to forms that might plausibly have been spoken by someone once. The lemmatiser can make some judgements about which forms are more oddball than others. It won’t be infallible—after all, it thought ἄνδρι was Aeolic. But it’s better than nothing, and it’s what I’ve got at hand.

We’ve seen above that some forms of ἀνήρ are perversely accented, and one form is a grammarian’s reconstruction. We can come up with a word form count which eliminates those four forms of ἀνήρ (though it will preserve the accidental Aeolic of the Byzantines). I’m going to filter out the following categories of word forms marked by the current lemmatiser:

Hypotethical forms (like *ἀνρός)
Hypercorrect forms (like ἀνδράς, or any number of other Byzantine hybrids and not-quite-genuine Doricisms)
Uncertain inflection categories (the lemmatiser has insufficient information on how a stem should be inflected)
The tense stem used to account for the verb form is not in the lemmatiser lexicon (so this could still be guesswork)
The inflection is anomalous (typically, it’s the “wrong” class of inflection by conventional norms—which covers lots of confused Byzantine optatives)
The form is a transliteration of Latin (occurs in Legal texts)

This should give us a count of Forms of Good Standing. There’s more grammatical eccentricities than that, but those are the most egregious.

	Word Forms	Reduced
TLG + PHI #7	1,300,717	1,267,434
TLG (viii–XVI)	1,183,120	1,158,529
Mostly Pagan (viii–IV)	518,321	515,275
Strictly Ancient (viii–iv)	289,275	288,305

Not a massive cut, but a necessary one. Again, the Strictly Ancient corpus is better behaved overall, so there are less anomalies there that need culling.

The next cut will be more cruel. Stems and inflections are marked for dialect and period in the lemmatiser; again, not infallibly, but indicatively enough. There’s still a lot of Late forms in the corpus, including Graecobarbara. There’s also lots of forms from non-literary dialects, that weren’t lucky enough to have an Alcman or a Sappho.

We can filter out the Boeotian and Cretan and Locrian, and the Koine and Byzantine and Demotic, to give just forms compatible with literary Ancient Greek dialects. That’s an artificial barrier, sure, but no less artificial than including them all in the same corpus to begin with; and there are plenty of people, ancient and modern, who would look approvingly on this “no riff-raff” policy. It will make the corpus somewhat more morphologically consistent: we’ll at least be talking about five centuries’ worth of morphology, not twenty-five.

Limiting word forms to Forms of Good Standing And Pedigree does still include lots of word forms that were only devised in the fourteenth or fifteenth century, because the ancient corpus did not exhaust all the possibilities of the ancient language(s). Moreover, proper names haven’t been quarantined off from antiquity the way vocabulary proper has. So the count will be much more approximate and fuzzy than it seems. (That holds for all the counts here, of course; come back tomorrow, and the lemmatiser will give different counts.) Still, applying a No Riff-Raff constraint on the corpus—excluding post-Classical and dialectally marginal forms, and keeping just linguistically Classical forms, as the lemmatiser currently understands them—gives us:

	Word Forms	Reduced
TLG + PHI #7	1,267,434	1,135,915
TLG (viii–XVI)	1,158,529	1,041,520
Mostly Pagan (viii–IV)	515,275	505,302
Strictly Ancient (viii–iv)	288,305	285,856

The final cut is the cruellest of all, and it’s so cruel only a linguist would do it. Forms of Good Standing And Cecropian Pedigree; in other words, Naught but Attic. No Aeolic, no Doric, no Ionic, and—here’s the killer—no Epic. That cannibalises any classical literary work there is, and it’s an over-idealised notion of what was spoken in Athens: there would have been some Aristophanean peasants that spoke like that, but no educated Athenian would have. And of course in the other direction, Byzantines kept coming up with Attic-compatible words too. Still, this cruellest of all cuts will give us a word form count that describes just one dialect at a time. Subject to how much the lemmatiser knows about Greek dialect, and again, the lemmatiser is not infallible, and will never be complete.

But again, as long as you understand these numbers will be just indicative, and just illustrative, and are worth what you paid for: these are the Attic-only word form counts for the corpora:

	Word Forms	Reduced
TLG + PHI #7	1,135,915	1,020,232
TLG (viii–XVI)	1,041,520	952,993
Mostly Pagan (viii–IV)	505,302	458,933
Strictly Ancient (viii–iv)	285,856	248,914

Finally got the Strictly Ancient count to budge. 🙂

So we started with 1.3 million forms attested of Greek in our corpora; limiting them to forms compatible with Attic Greek takes that down to 1 million. For the Strictly Ancient corpus (when Attic was still a living dialect), that’s 250,000, down from 290,000.

Now, what did all that prove?

For two millennia after Attic was no longer a living language, it remained a language of literature. There are some 890,000 post-classical wordforms, but over two thirds of them are compatible with Attic, and only a tenth of them are linguistically Late. Now, in part, that’s simply because Greek did not turn into Lithuanian: there are plenty of words in Modern Greek that are compatible with Attic too—so long as you’re relying on historical orthography. But if the literary language reflected the vernacular more accurately, the count would be a lot less than two thirds. So writers kept using Attic morphology productively.
The word form count for the corpora includes a lot of problematic forms, particularly the later we get (and the more artificial the literary language becomes). These problematic forms are part of the heritage of literary Greek; but it is misleading to include them in evidence of the productivity of Greek morphology: many of them are fantasy morphology. That said, these problematic forms are not frequent as forms (2% of TLG forms), and are even less frequent as instances (0.7% of all the words in the TLG corpus).
Nonetheless, cutting the morphological variety of Greek down from three millenia to a couple of centuries of Attic does make a difference: 85% of all forms in the Strictly Ancient corpus are Attic, and 80% in the full TLG corpus.
… although in cultural, literary, and even sociolinguistic terms, limiting the morphology to Attic Only is an artificial thing to do. But that’s weighing the scales for you: there’s a reason why it happens.

So should we cite Classical Greek as having just 1 million, or 250,000 word forms, instead of 1.3 million or 1.8 million? Nah, we should not be counting word forms in a corpus at all, and limiting ourselves to accidents of attestation. But we should also be aware that any corpus like this is going to have forms that are more at home or less at home. And that it all depends.

One final note. The lemmatiser, as I keep saying, is changeable and fallible: the numbers I’ve been giving—and which I will give in later posts, once I start counting lemmata—are transitory, indicative, and unreliable: they only tell you how far one piece of software has gotten with one lexicon and one corpus. Because the TLG lemmatiser has to do a lot more than lemmatisers normally do—coping with six dialects and three thousand years—it runs into a lot more ambiguity than is usual; and it tries to deal with that ambiguity by ranking analyses of word forms as more or less plausible. If you’re using the TLG lemmatised search, you can view the word forms which the lemmatiser thinks *might* belong to the lemma you’re searching for, but probably don’t.

So if you search for ἀνήρ as a lemma, you’ll get the 42 forms we’ve been talking about. In fact, you’ll get 103 forms, because of all the variations in accent and crasis and apostrophes we’ve mentioned before—though the list is case-folded. But you can also access, by clicking Show lower confidence forms, word forms that the lemmatiser thinks might but probably aren’t instances of ἀνήρ. As of this writing, that list includes:

κἄνδρος (1) (More probable lemma: Ἄνδρος)
ἄνδρου (56) (More probable lemmata: Ἄνδρος ἀνδρόω)
ἀνδρου (1) (More probable lemmata: Ἄνδρος ἀνδρόω)
ανδρου (1) (More probable lemmata: Ἄνδρος ἀνδρόω)
αντρον (1) (More probable lemma: ἄντρον)
ἄντρου (126) (More probable lemma: ἄντρον)
αντρου (1) (More probable lemma: ἄντρον)
ἄντρ’ (4) (More probable lemma: ἄντρον)
ἄνδρους (1) (More probable lemma: ἀνδρόω)

For the most part, the lemmatiser is correct in dismissing these analyses: ἄντρ’ is not a Demotic analysis, but a Euripidean mention of “cave”, and ἄνδρου (Ἄνδρου) refers only to the island of Andros.

But the lemmatiser is fallible, and it has slipped up with ἄνδρους. (Yes, I’m fixing the analysis now.) The lemmatiser had the alternatives of treating this as a Byzantine attempt at “thou wert manning” (with no augment, so the Byzantines would be play-acting at Homer); or a Demotic accusative of “men”, in the completely wrong declension. The lemmatiser decided the Demotic wrong declension was even more absurd than the Byzantine play-acting. As it happens, it’s wrong, this is a Demotic wrong declension after all (in the notoriously patchy vernacular Historia Imperatorum). So next time the lemmatised search engine will be updated at the TLG, there’ll be 43 simple forms of ἀνήρ after all.

Once more with feeling: don’t take the numbers too seriously. (As if the preceding posts didn’t argue that at nauseam already.) Just use them to get an order of magnitude sense of what’s going on with Greek.

Lerna Va: Word Form Counts, pruning

By: Nick Nicholas | Post date: 2009-06-11 | Comments: 5 Comments
Posted in categories: Ancient Greek, Linguistics, Mediaeval Greek
Tags: Ancient Greek, Byzantine Greek, Lerna, lexicon, TLG

[Counts in this post have been corrected in Lerna VIc]

So surely, after all the disclaimers in previous posts, I will now tell you how many words there are in Greek?

Oh no. Not at all. Not even close.

Before I alight at the burning question of how many lemmata of Greek (and when), I’m going to spend a good deal of time on how many word forms of Greek. I’ve bandied a count already on these pages, and I’m going to reduce that count, slice by slice, until it represents something more reasonable. Not completely reasonable, but more reasonable.

Recall that we established four concentric corpora. When we extract unique strings from each corpus, we can (and do) do some normalisation of those strings. We delete non-textual material: Jona[tha]n and Jŏnă|thān are both counted as Jŏnă|thān, because those brackets and diacritics don’t change the meaning of the word. For Greek, we also do some basic normalisation of accent: the grave is positional variant of the acute, and words with two accents are phonological variants of words with one accent. So ἄνθρωπός is counted as the same word form as ἄνθρωπος, and καλὸς is counted as the same word form as καλός. We also reattach hyphenated words (which for some texts is trickier than it should be), and we ignore words which are only fragmentary (as routinely happens in inscriptions and papyri).

With that normalisation done, we get the following counts of unique strings in the corpora, for the TLG corpus as of this date.

	Word Instances	Word Forms
TLG + PHI #7	102,005,245	1,861,358
TLG (viii–XVI)	95,475,128	1,567,892
Mostly Pagan (viii–IV)	16,312,159	605,335
Strictly Ancient (viii–iv)	5,464,913	334,428

We can already notice a few things:

The PHI#7 papyri and inscriptions have 7% more word instances, but 19% more word forms; so there’s lots of novel strings in the papyri and inscriptions. Because there’s lots of new lemmata? Sure. But also because there’s lots of mispellings. That’s right, a misspelling counts as a unique string; so we’ll have some sifting ahead of us.
More word instances is not directly proportional to more word forms: most word forms are very common, and novel word forms follow a law of diminishing returns. Going from Strictly Ancient to Entire TLG multiplies your word count by 18, but it only multiplies your word forms by 4.5. Because 18 times more text means 18 times more occurrences of and, and 18 times more occurrences of the, and only at the bottom of the sieve do you find lots of novel words.
Even factoring all that in, later texts did come up with lots of novel word forms. How many, we’ll see later.

So, what does 1.6 (or 1.8, or 0.6) million unique strings mean? As we’ll see, not as much as you might think. Let’s take the lemma ἀνήρ, “man”. By this criterion, the TLG has no less than 137 distinct word forms corresponding to ἀνήρ. Pretty impressive, when it should just have 11 forms in any given dialect. This is what it should look like in Attic:

	Sg	Du	Pl
Nom	ἀνήρ	ἄνδρε	ἄνδρες
Gen	ἀνδρός	ἀνδροῖν	ἀνδρῶν
Dat	ἀνδρί	[Like Gen.Du]	ἀνδράσι
Acc	ἄνδρα	[Like Nom.Du]	ἄνδρας
Voc	ἄνερ	[Like Nom.Du]	[Like Nom.Pl]

So how did we get from 11 forms to 137? For one, yes, we have multiple dialects in there. But that’s by no means the main reason; in fact, we’re not even going to get to *that* issue, messy as it is, until the next post in the series. Take a look at the 137, this time resplendent in the Greek Font Society Didot typeface:

See the problem? The “unique strings” are case sensitive. Now, there is a reason why I did that: Greek has capitonyms—words that have different definitions if they are capitalised or not; so Ὅμηρος is “Homer”, but ὅμηρος is “hostage”, and Ἱππίας is “Hippias” while ἱππίας is the adjective “of the equestrian [fem]”. The distinction needs to be made for lemmatisation, but it is not extremely frequent; and for words that aren’t capitonyms, it leads to drastic inflation of word form counts. If we do away with casing in our strings, we get something closer to the spoken (and early written) linguistic reality of Greek. Yes these word forms become more ambiguous, but we’re not left trying to claim that ΑΝΔΡΑΣ, ἄνδρας and Ἄνδρας are different words.

Our 137 forms then go down to 105, and our overall counts tumble as follows:

	Word Forms	Reduced
TLG + PHI #7	1,861,358	1,698,134
TLG (viii–XVI)	1,567,892	1,408,908
Mostly Pagan (viii–IV)	605,335	572,537
Strictly Ancient (viii–iv)	334,428	319,512

That’s not enough though: notice that we’ve eliminated ΑΝΔΡΕΣ, because there’s also a lowercase ανδρες, but we’ve kept ΑΝΔΡΙ, because there is no lowercase ανδρι. But of course, ανδρι is just ἀνδρί shorn of its accents, for whatever reason, and shouldn’t be counted separately. If any word form is missing its stress or breathing, we should ignore it if the same word form occurs with a stress or breathing. That will mangle a couple of enclitics, but we’ll undo that damage in a couple of counts, and at any rate it will affect only a dozen or so word forms.

So, conflating ΑΝΔΡΙ and ανδρι to ἀνδρί, and requiring word forms to have breathings and accents, our 105 forms go down to 86, and and our overall counts to:

	Word Forms	Reduced
TLG + PHI #7	1,698,134	1,649,083
TLG (viii–XVI)	1,408,908	1,376,016
Mostly Pagan (viii–IV)	572,537	562,744
Strictly Ancient (viii–iv)	319,512	314,887

To go any further in interpreting word forms, we have to associate them to particular morphological analyses and lemmata. That means we should restrict our counts to word forms that the lemmatiser recognises, because we can’t say much reasonable about the word forms that it doesn’t. Right now, with casing intact, the TLG lemmatiser recognises close to 94% of the word forms in the TLG corpus, and 60% of the word forms in PHI #7. That’s sacrificing something (6% and 40% of the word forms respectively), but we can’t talk about word forms that we don’t understand; and a lot of those words won’t be words anyway—there’s incantations and geometrical lines and all sorts of stuff in there.

(Of course, if you talk to me tomorrow, I’ll be throwing out less word forms, because the lemmatiser is constantly being made cleverer.)

Eliminating unrecognised word forms and folding case, as we’ve been doing, gives us:

	Word Forms	Reduced
TLG + PHI #7	1,649,083	1,435,391
TLG (viii–XVI)	1,376,016	1,282,298
Mostly Pagan (viii–IV)	562,744	557,574
Strictly Ancient (viii–iv)	314,887	313,354

Let’s pause here. So far, we’ve normalised case and (somewhat) accentuation, and we’ve constrained our word forms to those the lemmatiser understands. Our overall count has gone from 1.86 to 1.44 million. Our Strictly Ancient count has gone from 334 to 313 thousand—that corpus is overall much better behaved, so there’s less there to clean up. Notice that getting rid of unrecognised word forms makes a huge dent in PHI #7 (the lemmatiser doesn’t like phonetic spellings), but barely a scratch on the Ancient corpora (because Ancient Greek is well documented.)

Now, the lemmatiser does cleaning up of its own when it recognises words.

When it sees an apostrophe, it analyses it by filling in the missing vowel: ’νδρες = ἄνδρες.
When it is confronted with words unrecognisable on their own, it comes up with alternate spellings which can make sense of the word as spelled—that’s how it gets anywhere with phonetically spelled church deeds or papyri. So it understands the monstrous diplomatic spelling δειακαίλἐυονται as διακελεύονται.
What’s a monstrous spelling like that doing in the TLG to begin with? Diplomatically published church deeds. That’s why editors normalise. In fact, for all the chaos in the spelling of PHI #7, a lot of the words do have a bracketed normalisation next to them on the CD, and I’ve used those normalisations rather than the original readings in the counts.
If the accentuation has an acute in the fourth last syllable, or something else absurd like that, it analyses the word as if it were accented more sensibly. So it knows ἤλλοιτριωσθησαν is meant to be ἠλλοιτριώσθησαν, and ἦλπισαν is ἤλπισαν.
Iota adscripts are respelled as iota subscripts. So the lemmatiser treats ἦιδε the same as ᾖδε.
And if a word has undergone crasis, merging two words phonetically, the lemmatiser pries them apart again: κἀνδρῶν is broken up into καὶ ἀνδρῶν, and counted as an instance of ἀνδρῶν.

So the lemmatiser does some normalisation of words: it dismisses what are to it obvious misspellings, and it fills in phonologically missing bits of words. This does not get rid of all potential “misspellings”: a lot of them have been added manually to the lemmatiser as variants in the texts. But these normalisations do need to be taken into account when counting word forms. ανιρ is just a phonetic spelling of ἀνήρ, not a novel word form. ἄνδρ’ is not a distinct word form from ἄνδρα, nor is ’νδρες distinct from ἄνδρες, or κἀνδρῶν distinct from ἀνδρῶν.

With the normalisation the lemmatiser can do on its own, the 86 forms of ἀνήρ go down to 50—getting rid of all crases and apostrophes; and the word counts go to:

	Word Forms	Reduced
TLG + PHI #7	1,435,391	1,352,303
TLG (viii–XVI)	1,282,298	1,232,209
Mostly Pagan (viii–IV)	557,574	539,469
Strictly Ancient (viii–iv)	313,354	301,005

Greek phonology has always featured the nu movable, an /n/ which can occur optionally at the end of some inflections, depending on what phoneme follows it. So “is” is ἐστι before a consonant, and ἐστιν before a vowel—leading to the Classic example of why the Ancients should have spaced their words, ἐστι νοῦς “it’s a mind”, ἐστιν οὖς “it’s an ear” (esti nôːs, estin ôːs).

In other words, this /n/ is a liaison phoneme, and its presence or absence does not make the word distinct. So pairs differing only by a nu movable should not be differentiated as novel word forms (and the lemmatiser knows which /n/s are movable). That takes the 50 forms of ἀνήρ down to 43, and the word counts to:

	Word Forms	Reduced
TLG + PHI #7	1,352,303	1,307,842
TLG (viii–XVI)	1,232,209	1,189,688
Mostly Pagan (viii–IV)	539,469	519,498
Strictly Ancient (viii–iv)	301,005	289,812

The lemmatiser also recognises some strings of Greek that it knows are not words, but abbreviations (Αν is used to abbreviate ἀνήρ at least once), Greek numerals, or geometric lines. (The corpus does include Archimedes and Euclid, after all.) Excluding such non-words takes us to:

	Word Forms	Reduced
TLG + PHI #7	1,307,842	1,300,717
TLG (viii–XVI)	1,189,688	1,183,120
Mostly Pagan (viii–IV)	519,498	518,321
Strictly Ancient (viii–iv)	289,812	289,275

We could keep going, but we won’t, because going further is going to be a lot more onerous. There are lots of “wrong” spellings in the Byzantine era:

uncertainty about whether to circumflex or acute stems (which count as different word forms here): κῦμα κύμα
uncertainty about whether to have double or single consonants (which is what I’ve been dealing with for the past couple of months): ἁγνόρυτος ἁγνόρρυτος
accents on a wrong but legal syllable of a word (which as far as I can tell, Byzantines did Just For Fun): ἄβυσσος ἀβύσσος.

At a guess, that kind of spelling variation may account for 2% of the word forms of the TLG. But this has already gone on plenty, and the point’s been made: it’s true that there are almost 1.6 million distinct strings as far as the TLG Word Index is concerned, but chop off a quarter of that to get closer to a realistic word form count. And if you limit yourself to just an Ancient Greek corpus, the 1.2 million becomes 500 or 300 thousand word forms.

Is that a lot? Well, noone said that Greek wasn’t a highly inflected language. We’ve already seen at length why being a highly inflected language doesn’t automatically give your culture extra IQ points—it’s what you say, not how many suffixes you use to say it. Still, at a rough guess, this means between 3 and 6 word forms per lemma on average in the Greek corpus: common verbs will have hundreds of word forms corresponding to them, while the Long Tail of lemmata will have only one or two forms represented in a corpus. That’s not bad, but it’s not exceptional even among inflected languages—let alone agglutinative.

Let’s compare Slovenian, which is certainly up there among modern inflected Indo-European languages. Rotovnik et al. used a newspaper corpus comparable to what we’re talking about here, and a dictionary of 60,000 lemmata. Now, the thing about lemmata we will see in future episodes is, you never stop counting them: lemma counts are open-ended. All you can do is say, if I know this many lemmata, I can recognise this percentage of word forms in a corpus. So:

	Ancient + Byzantine Greek (TLG)	Contemporary Slovenian
Word instances	95 million	105 million
Word forms	1.2 million	660,000
Lemma count	say 205,000?	60,000
Unrecognised word forms	6.2%	8.7%
Avg. word forms per 1000 word instances	12.6	6.3
Avg. word forms per lemma	6	10

No, this is not a race, and we’re not going to call Slovenian better or worse than Atticist Greek. Nor am I going to go into the sophistication of Rotovnik et al.’s word recognition model, which uses sub-words to improve recognition—and goes from 8.7% unrecognised down to 1.2%. I’m already doing some less sophisticated tricks to get as far down as 6.2%, because the TLG corpus is much messier than the corpus of Večer articles. No, the point is that a language like contemporary Slovenian, without three thousand years’ and six dialects’ worth of weighing down the scales, gives you the same order of magnitude of morphological diversity as do the Three Thousand Years of Greek.

And of course, Three Thousand Years of Greek may have double the word forms of five years of Večer; but once you go to an agglutinative language, Greek’s out of the running, because agglutinative languages pack a lot more into their words. I can’t get a lemma count from Kamadev Bhanuprasad’s study on speech recognition in Telugu; but his newspaper corpus has 20 million word instances, and 615,000 different word forms: 30.8 word forms per thousand word instances, to the TLG’s 12.6. Which tells us what we already knew: Greek is not the most morphologically productive language on the planet.

We’ve cut the word form count down for the Greek corpus to something more realistic; but “Realistic” is a problematic thing to say anyway, because we’ve still got to explain how 11 forms of ἀνήρ got to 43. That’s the story about the dialectal and diachronic diversity of the corpus, and it will have to wait for the next instalment.

Lerna IV: Corpora

By: Nick Nicholas | Post date: 2009-06-10 | Comments: No Comments
Posted in categories: Ancient Greek, Linguistics, Mediaeval Greek
Tags: Ancient Greek, Byzantine Greek, Lerna, TLG

So having spent four posts on why we should not count words of Greek, I will count words of Greek. The counts are only meaningful relative to a corpus, so here I detail what’s in the corpus I’ll be using, PHI #7 + TLG—and how I will end up treating it as four concentric corpora. There is also some information on the distribution and coverage of the TLG, which may be of interest even if you’re not interested in counting words.

The corpus I’m using consists of a group of texts I’ve come to know well, the TLG; and a group of texts I know less well, the PHI #7 disc. The Thesaurus Linguae Graecae is a digital library of Ancient and Byzantine texts, which has been steadily moving forwards in time: it’s increasing by around 3 million words a year. Counting words of data-entered text, including markup, hyphenated fragments, and symbols, it currently has close to 105 million word instances; if we restrict the count to just words of Greek, it has 95 million words. (That’s your first indication that counting words is complicated.)

The TLG doesn’t have all the text there is, but it does have a lot, and it’s filling in texts as it goes. I’ll reuse the grid I used for dictionary coverage of Greek to show how:

This is a crude representation of the current coverage of the TLG:

The TLG does not cover ancient non-literary texts, which are attested in papyri and inscriptions. That matters for counting words, because a lot of lemmata are only attested in non-literary sources. Non-literary sources range over details of daily life (especially in the papyri), and dialects absent from the literary canon (in the inscriptions). Non-literary texts are where texts keep showing up from antiquity, and where both the LSJ Supplement and the Diccionario Griego-Español get many of their new lemmata from.

This area is not covered by the TLG, with only a couple of exceptions (Epistulae Privatae); to address that, I’m also including the PHI #7 disc from the Packard Humanities Institute in the corpus.

The PHI #7 disc, which has 6.5 million words of Greek, was issued in 1995, and it includes three collections: a corpus of ancient inscriptions [now online], compiled by Cornell and Ohio State Universities (3.1 million); the Duke Databank of Documentary papyri (3.1 million) [also online]; and Inscriptions of the Christian Empire, compiled by John Mansfield of Cornell (0.3 million). New inscriptions and papyri keep showing up all the time, so PHI #7 does not cover everything we know we have; but it is representative enough.

The TLG does admit non-literary texts for the mediaeval period. The monastic acts in particular are diplomatic editions (i.e. preserving the original spelling of these monastery legal documents in all their inventive confusion). Their misspellings cause our counts of word forms all manner of trouble, as we’ll see. But the TLG has not been working on mediaeval inscriptions either, so PHI #7’s Christian inscriptions fill in a gap as well. The Christian inscription corpus is small, but it covers a lot of ground: the proto-Bulgarian inscriptions are here, and the texts go late enough include Χακῆ as a rendering of Hajji.

For literary texts, the TLG is pretty much complete for antiquity, strictly defined. There are some gaps for the early Christian era, which are currently being filled in: the TLG is still missing some apocrypha, liturgical texts, and the Hexapla, including the Hebrew Scripture translations of Aquila and Symmachus. In terms of raw lemma count, the Latin–Greek glossaries are not in yet; when they are added to the TLG, they will account for something like a thousand LSJ lemmata currently absent from the corpus.

That an ancient dictionary should have so many one-off lemmata in it is no surprise: dictionaries contain words that people didn’t know, and which are unlikely to turn up anywhere else. Which is why Hesychius is so important to comparative linguistics—and such a pain to do anything sensible with in lemmatisation. The Latin–Greek glossaries are a mixed bag: they contain words never heard of before or again (e.g. τηκεδονικός, -ή, -όν tabificabile); but they also have the first instances of common modern words (e.g. τζάπιον, τό bidens, ligo, raster—i.e. Modern τσάπα “hoe”).

For the mediaeval period proper, TLG work on expanding the corpus is ongoing. We have had a guess a couple of years ago that we had 70% of the texts covered by Trapp’s dictionary: that would translate to some 20 more million words. The TLG has now started to include Early Modern texts as well: it has a while to go yet, but it already has 2.5 million words of the vernacular. Of course, this is (a) only a small proportion of Early Modern Greek texts, and (b) nothing about the contemporary Modern Greek language. So this corpus doesn’t tell you much about anything past 1600 at the moment.

We saw in the post on the dictionary coverage of Greek that various periods do better or worse in how well they are covered by grammars and dictionaries—and how “clean” their texts are. (Way too much Migne still for editions of the Church Fathers, for one.) That’s reflected in the lemmatiser I’ve been working on for the TLG: it deals with Ancient Greek proper exceedingly well (99.4% recognition up to Aristotle), but more patchily with Mediaeval Greek (94.6% for viii-xvi AD learnèd, as of May). These can be illustrated with degrees of certainty of recognition by the TLG lemmatiser, something I’ll talk more about later. (And note that these figures change month by month, as the lemmatiser is improved.)

Lemma recognition is at a loss with the PHI #7 texts (around 65%). In large part, that’s because of the more chaotic spelling used in those texts. In at least some part, that’s because I’ve spent 6 years tweaking the lemmatiser to TLG texts, and only a couple of hours tweaking it to PHI. I’m missing a whole lot of inscription- and papyrus-specific lemmata from DGE (where the growth spurt is), and there’s a whole lot of Egyptian proper names the lemmatiser hasn’t heard of, so PHI is going to be underrepresented in any lemma counts I try to work out.

Moreover, we saw in more recent posts that the longer the time span of a corpus, the more incoherent any counts are. Given that later Greek is less well documented, less well edited, and less Classical than earlier Greek, I’m going to split my corpus in four, and give counts for each, moving progressively closer to the Ancient core. So:

Counts for TLG + PHI #7.
Counts for just TLG, which I’ve had more command over than the PHI #7 corpus (and which concentrates us on literary texts for antiquity)
A Mostly Pagan Mostly Ancient corpus.
A strictly Ancient corpus.

The strictly Ancient corpus stops with Aristotle, fourth century BC. That covers the classical canon, which everyone since has admired and emulated; but it’s not all the texts counted as ancient in the broad sense: it leaves out Polybius, Plutarch, Lucian—and the Judaeo-Christian scriptures. Antiquity conventonially ends with Nonnus, in the sixth century AD. But having an ancient corpus go up to the sixth century will include too much “unruly” texts: texts in poor editions, or texts where the classical norms aren’t as consistently observed.

To clean up the corpus somewhat, and present a middle ground between the full TLG and Homer-through–Aristotle, I’m positing a Mostly Pagan Mostly Ancient subcorpus. This goes up to the fourth century AD (so Synesius is in, Nonnus is out), and it includes the Jewish and Christian scriptures. But it excludes any other Christian writings, and technical writing: medical, legal, alchemy, astrology, lexicography, grammar, scholiastic, philology, geography, mathematics, mechanics [engineering], and magical. That’s pretty brutal, but both the technical and the Patristic corpora are linguistically distinct from the literature of Lucian and Synesius, and are the kinds of text that Classicists, for better or worse, have paid less attention to. So this Mostly Pagan Mostly Ancient corpus is a literary corpus, comparable to the strict Homer-to-Aristotle grouping, but with a less straitened timespan.

Limiting the time span like that cuts down the 95 million word corpus significantly, because of how unequally texts are attested from different periods. The strictly Ancient corpus is just 5 million words large; and there are some striking disparities in how texts are represented in the TLG by century:

Obligatory provisos about the century breakdown: it’s by author not work, so a small number of later texts get included in earlier centuries. The most egregious instance is in the Hippocratic corpus, which includes among its Ionic a text so modern, it uses Italian words for “virtue” and “colour” (βερτοῦ, κλόρε). The “Varia” are mostly scholia, which cover any time from Roman times to the late Middle Ages. But the proportions are indicative enough.

The inconsistencies will be clearer in bar chart form:

Most of the spikes in the graph can be explained. The iv AD spike are the major church fathers, and texts attributed to them—which make John Chrysostom the most prolific author in the corpus. The ii AD spike is in large measure because of the disproportionate representation of medical authors (and the Second Sophistic), and texts attributed to *them*—which make Galen the second most prolific author in the corpus. The dip in vii–viii AD is presumably the Byzantine Dark Ages (yes, yes, I know the term is problematic). The dips in other centuries, especially ii BC and iii AD, I don’t really have an explanation for.

The disproportionate spikes go away if we take Christian and technical texts out of the equation, and restrict ourselves to literature (à la the Mostly Pagan Mostly Ancient subcorpus, which adds up to 19 million words).

The Golden Age of Classical Literature does not look so underwhelming, for one:

There’s still some spikes that may or may not come as a surprise. vi AD is bolstered by the voluminous Neoplatonists; even without the medicos, the Second Sophistic was prolific; the Comnenan Renaissance and Palaeologan Renaissance, xi AD and xiii AD, are now visible. And once the Byzantine legal texts are taken out of the picture, the Dark Ages look Darker: they weren’t as dark a time for lawyers…

Lerna IIId: Why we do not count lemmata

By: Nick Nicholas | Post date: 2009-06-05 | Comments: 5 Comments
Posted in categories: Ancient Greek, Linguistics, Mediaeval Greek
Tags: Lerna, lexicon

Now, the whole point of any word counting venture, such as Lerna attempts and gets galumphingly wrong, is not the corpus size, which is contingent and always less than infinity; nor is it the number of word forms, which tells you about morphological happenstance but not about vocabularies. When people talk about words, they mean dictionary words.

This veers off into Eskimo Words For Snow territory, so it’s even more fraught for a linguist to talk about. Especially because, even more than for word forms, there is a lot of arbitrariness to be had about how you count lemmata. Enough arbitrariness to make the whole venture deeply problematic. It’s especially problematic if, like the artificially inflated corpus of the TLG (or the OED, or indeed any dictionary), the corpus spans more than the vocabulary contained in one skull, and ranges over more than one region, and more than one decade. That brings together all the words you might need to know if you ever come across them, in a literate culture that preserves words in print for centuries. It does not bring together all the words you ever will have in your skull: it’s not modelling the vocabulary that any speaker will ever command. Dictionaries are documenting an inflation inherent in any written language; it is particularly pronounced for Greek, for reasons already seen.

Now, it’s reasonable to assume that if your language gets used by more people, to talk about more stuff, in a culture where more stuff is around, and in contact with lots of other languages and their speakers’ stuff, then that language will have more words. The Greek of the Roman Empire was like that. The English of the Globalisation Empire is much more like that. So if the guesstimates are that contemporary English has twice the dictionary words as contemporary Spanish, that’s plausible.

The Greek of the Classical Age invented much of how the West understands the world. But it was not exploding with words. The Spartans weren’t the only Greeks to be Laconic: Classical Greek was frugal with its words—enough for its philosophy to look basic (or unsophisticated), compared to the German experience. As we’ll see, the vocabulary explosion happened much later. Look at how Plato writes about philosophy, how a speech in Euripides works—how insistently Aristophanes snipes at Socrates’ and Euripides’ new-fangled words, and how unremarkable those new-fangled words turn out to be. “Verse” στίχος, Frogs 1239, was such a new-fangled word, for goodness’ sake, as Andreas Willi writes: The Languages of Aristophanes, p. 58; yet it’s merely reusing the word for “line”.

Plausibility was never the point of the Lernaean text, nor is it perturbed by any actual familiarity with Classical Greek. But even with three millenia of vocabulary buildup pitted against 500 years’ worth of Modern English, the world is working out in such a way that Greek is not going to beat English in the “my lexicon is bigger than your lexicon” games. The information overload explosion is being engineered in English, and involves English coinings. Where the vocabularies are growing, other languages are struggling to keep up, and most don’t bother: IT done outside of English is now all about the codeswitching. Lernaeanists hear the codeswitching and see the scriptswitching all around them, yet still they assert in their Letter to the Editor that English having more words than Greek must be some kind of joke. (“Και προβάλλεται ως τέτοια η Αγγλική, που μόνο σαν ανέκδοτο μπορεί να θεωρηθεί.”) That… must be some kind of joke itself.

But as the about.com answerer hastened to add, English having double the words of Spanish doesn’t mean Spanish doesn’t have nuances which English can’t readily express. Or that any other language doesn’t. There are still notions particular to any given culture, which that culture’s vehicle language will have words for, and another culture’s language won’t have had a reason to come up with a word for. That’s true of farm implements vs. modem protocols, and it’s true of all the subtle constructs that each language’s poets embrace zealously, and that the Meaning of Tingo book series did such a superficial job on. (At least the guy has a blog, so there’s some avenue for the readership to fine tune things.)

It always struck me as amusing, btw, that most such “untranslatable” Modern Greek words… are Turkish or Venetian. Although of course, whatever meaning they’ve since picked up is quite distinct from when they first entered the language. It’s a long way from merak “hypochondria” to μεράκι “outburst of creativity” [EDIT: better, “sustained creative effort”]. The sequence, from what I surmise, is: hypochondria > lovesick > yearning > fastidious about one’s work > taking pride in one’s work. By a similar pathway with a last-minute detour, meraklı “hypochondriac” > μερακλής “bon vivant, connoisseur”… Come to think of it, those French words are untranslatable too, aren’t they.

There’s likely more animal husbandry terms in Masai than Pitjinjarra, and more terms for kinds of cheese in Italian than Laotian, and more terms for intellectual property arrangements in English than in Sorbian. That’s the anodyne version of the Eskimo Words For Snow business, and not particularly surprising. Again, it doesn’t mean brains are wired differently. You can translate μεράκι with some work. Unlike Spanish (and Old English), German does not have a verb to distinguish essential being from contingent being (ser/estar, bēon/wesan). That didn’t put the brakes on German philosophy (!) , and it didn’t prevent them making a nouns for persistent existence, Dasein. Not having as many words as the language up the road is not such a deal-breaker in the end.

But as to this urge to have more words than English, in a game that can’t be won and makes no sense anyway… it’s malicious to, but I’m compelled to recount the 1980 Richard Feynman in Greece episode (Link 1, Link 2):

They were very upset when I said the development of the greatest importance to mathematics in Europe was the discovery by Tartaglia that you can solve a cubic equation: although it is of little use in itself, the discovery must have been psychologically wonderful. It therefore helped in the Renaissance, which was freeing man from the intimidation of the ancients. What the Greeks are learning in school is to be intimidated into thinking they have fallen so far below their ancestors.

Tartaglia’s work was done more than 1000 years after the Greeks and showed to the Greeks that a modern man could do something no ancient Greeks could do (Richard Feynman, What Do You Care What Other People Think?)

Lerna is a hoax, and Lerna is an annoyance, and Lerna is an embarrassment; but it will not die, because more than anything else, Lerna is a symptom. It’s a symptom of what Feynman found. And the way to singe the head of the Hydra is to get over that nagging sense of not measuring up to the Hellenes. Generations have failed to make headway there; but Lerna’s not making the job any easier, by attributing to a literature already bestriding the world a vocabulary 1000 times larger than life.

Lerna IIIc: Why the Greek scales are rigged

By: Nick Nicholas | Post date: 2009-06-05 | Comments: No Comments
Posted in categories: Ancient Greek, Linguistics, Mediaeval Greek
Tags: Ancient Greek, historical linguistics, Lerna

Even if you allow for the fact that Greek is flexional and has lots of inflections, a literary corpus of Greek is going to have a lot more morphological variety than most other literary languages. That doesn’t tell you something about the superiority of the Greek language. But it does tell you a bit about Greek culture. And it does mean that, if the word form and lemma counts of Greek come out better than expected, the comparison is not exactly fair.

The first catch is that the literary corpus spans three thousand years, as many a Greek ideologue likes to remind you: a trick only Chinese has gotten away with. Does that prove it’s the same language? That’s a loaded question, of course: if you believe the Moderns are the same people as the Ancients, you’ll call both Greek, as everyone does now; and if you don’t, you’ll distinguish Hellenic from Romeic, as everyone did three centuries ago. (That’s unless you were calling Romeic Graecobarbaric, which was also all the rage in some circles.) More to the point, if you believe the Moderns are the same people as the Ancients, your language will reflect that belief. A lot of that in the contemporary Standard is engineered: it results from the conscious efforts of Puristic, to bring older forms of the language back. Some of it is older conservative forces, notably the language of the church.

Greek is no Icelandic: the written literary tradition has had much more of an effect on the spoken language up North, and Iceland is a much smaller place. Greek may be on the conservative side morphologically, compared to say English; but the morphology has still changed quite a bit. Which means, if you count Homeric morphology and contemporary morphology in the same word form count, you’re going to get a lot more word forms than if you were doing one millenium at a time. And most counts of what a language’s word forms are take just a decade or so at a time, because most counts are synchronic: they’re snapshots of a language, not the whole Theseus’ Boat ten-part series. A synchronic count of Greek is going to show you a lot less variation, because people don’t normally have conversational command of three millenia’s worth of speech.

Normally, noone does: that’s not the language people have in their skulls, which is what most linguists deal with. Of course, you could compile a corpus of three millenia’s worth of language spoken in Rome; and you’d get Classical Latin, Vulgar Latin, several stages of Romanesco, and Standard Italian in the one list. With lots and lots of morphological variation. There’s a reason why you wouldn’t call that one language’s worth of morphology over three millenia in Rome, but three: so the different morphology shouldn’t be on the same listing. There’s a reason why you may choose to call it one language’s worth of morphology over three millenia in Athens (as long as you leave out the Arvanitika of Pllaka). The reasons for that aren’t entirely linguistic. They aren’t entirely non-linguistic, and the development of Greek has been affected by the underlying thinking. But these are all gradients and slippery slopes; and Greek is at one extreme of the slope. It proves Greek covers a long period; it doesn’t prove Greek-speakers have their brains wired differently.

There’s not just the three millenia upping the word count. All languages have regional variation, with different grammar and lexicon, up until they get spoken in just one place—or the mass media convince you that they are. People normally speak one dialect at a time, just like they speak one century at a time; so having a corpus span 3000 km of language doesn’t tell you more about what language is contained in a single skull than does having a corpus span 3000 years of language. So including 3000 km bumps up your word form count more than is strictly speaking fair.

The thing about Greek is, the literary culture made the same language span not just thousands of years, but thousands of kilometres. Literary Greek is pretty distinctive in having no less than six literary dialects: Epic, which is Old Ionic with other bits, New Ionic, Doric, Aeolic, Attic, and Koine. They’re conventionalised in the literary texts, and are not always linguistically reliable; but someone with a literary grounding by Hellenistic times was expected to be conversant in the lot of them; and the literary corpus does need to reflect them all. The literary corpus is not reflecting what was in any one Greek’s skull as their native speech, so comparing its morphological diversity to what other language corpora tell you is artificial. But once a language is literary, artifice happens: there’s more King James and Shakespeare in contemporary English than there should be, too, and more pepperings of American in Australian English than would have made sense a century ago. And at least some Byzantine scholars did have some command of much of this inflated repertoire of Greek morphology, as artificial as it got.

All this though is reason why counting word forms in Greek is misleading. I’m still going to attempt it, because it raises some further interesting questions, and we’re going to see the 1.5 million word forms I quoted whittled down a fair bit. (Having to control for spelling variation, for starters.) Last stop in the ritual abjuring of grocery calculations, lemmata.

Lerna IIIb: Why we do not count word forms

By: Nick Nicholas | Post date: 2009-06-05 | Comments: 1 Comment
Posted in categories: Ancient Greek, Linguistics, Mediaeval Greek
Tags: Ancient Greek, Lerna, morphology

Greek is a flexional language: it’s not English. A single noun can have 11 different inflections. A single adjective can have 23 inflections. A single verb? I’ll throw in the second aorist as well as the first, though I really shouldn’t—verbs mostly had just one aorist at a time. I’ll be generous, we’ll call it 740 forms.

Many a student has gazed in wonder at the subtlety and copiousness of the Greek verb table. I’m sure about as many have been annoyed at the rote memorisation; but the reason the verb table gets admiring remarks is, the 740 forms are not random: they follow a mesh of patterns, which you can reconstruct in proto-Greek back to something pretty neatly agglutinative. On the other hand, a few centuries of phonological shuffling reconfigured the 740 forms enough to be interesting.

Of course, very few verbs are attested in a corpus with all 740 forms. Few verbs have both first and second aorists, to start with. And any corpus is going to display only a subset of what is possible in a language, and what language speakers will recognise as valid verbs. To our knowledge, πεπαίκοιτον “you two would have played” is not attested anywhere in Greek. But it’s a regular perfect dual optative, and the perfect indicative πέπαικα is well attested enough: it’s as valid a verb form of Greek as any other, whether anyone ever wrote it—indeed, whether anyone ever spoke it, or not. So though the TLG happens to have 219 forms of παίζω, all 534 possible forms of παίζω should count. (No second aorist.)

But once we admit all possible forms, and aren’t constrained by what’s in a corpus, we’re comparing langues, not paroles. And there are languages with more morphology than Ancient Greek. Finnish has fifteen cases. Sanskrit has comfortably over a thousand verb forms. Agglutinative languages, which don’t moosh affixes together into idiosyncratic combinations, can go a lot further than that. Turkish? OVER TWO MILLION verb forms.

So is Ancient Greek the only language with an interesting verb table? No, Sanskrit beats it. Is it the only language with lots of morphology? No, Turkish beats it, and Lakhota beats it, and Telugu beats it.

And does that prove Greek inferior to Sanskrit, or Turkish, or Lakhota? No, because that’s no valid criterion for judging language’s merit. And the reason Greek should bail out of this comparison is not that its 740 lose out to Turkish’s TWO MILLION, but that this particular flavour of grocers’ calculation doesn’t prove much of anything. Just as the 98 inflections of Modern Greek, or the four inflections of Modern English, don’t prove its inferiority to the 740 of Ancient Greek.

And really, why would they? There’s poetry in Chinese, and poetry in Lakhota; there’s oratory in Latin, and oratory in Arabic. Is a culture lesser for lack of a dative? Μὴ γένοιτο. Is it impoverished through absence of an optative? Ας σοβαρετούμε λίγο. In fact, just as Hellenomaniacs ponder whether you really are impoverished for lack of a dative, English-speakers in different venues—though no more scholarly—ponder whether you are impoverished for having one. Both can’t be right, and really, do we want to say that either is right? That’s a dodgy calculus to embark on.

Now, I have to swap hats from linguistician to literato for a minute, because the Hellenomaniacs do ponder the loss of the dative for a reason. “I prefer the synthetic nature of Ancient Greek to the analytical nature of Modern Greek”, one of the Sarantakos bloggers posted. With a linguistician’s hat on, that’s sentimental claptrap. But language is about a lot of things, including sentimental claptrap. It’s a vehicle for peoples’ idelogies, and it gets affected by those ideologies.

The dative ain’t coming back in Greek—been through that. But Puristic has had major impact on the Modern Standard, even if it didn’t augment its inflection count. And just because Puristic Greek failed to revive the dative doesn’t mean standard languages can’t choose to switch their typology, through deliberate acts of engineering. Estonian even changed its word order because of one language reformer. Are there linguistic reasons to do so? Not really, the languages were trundling along fine without the engineering. I mean, language typologies left on their own do change: they’ve got more analytical for the European languages we’re familiar with, but less analytical for Chinese. So it can happen. But it doesn’t have to, and natural ebb and flow of language structures is not why the language engineering happens. It’s “sentimental claptrap” that does it. If a language community is convinced to do something about its morphology, that has linguistic consequences, so it’s not something alien to linguistics.

That aside, you can have aesthetic judgements about how a language works. If you got Classical Greek under humane conditions in your schooling, you can look at a phrase in Lucian and say, “that’s elegant”. The datives and the optatives are part of that elegance. I’ve even thought “that’s elegant” about George Chatzidakis’ Puristic Greek. At the sentence level; because like most 19th century linguists, he was incapable of structuring an argument, and Chatzidakis’ elegant sentences add up to fifty pages of “and another thing” meanderings that can only be broached via a subject index.

Aesthetics matters; but aesthetics is informed by many a factor, not all of them linguistic. Modern Greek speakers have been attracted to the dative, which they don’t have and wish they did, like the Ancients; and they’ve been repelled by it, after being badgered that they should have it. It’s emotive either way because the Ancients are part of the equation. The Turkish ablative is as elegant, in purely linguistic terms, as the Latin one; but those who have longed for the Greek dative aren’t on record admiring Turkish sentence structure.

There’s no shame in aesthetic judgement being culturally informed. That’s the nature of aesthetics. But that tells you that the aesthetics are not mathematically provable, certainly not through a word form count. People used to want to nudge English in the direction of Latin, back when they too were burdened with its heritage. They’re chill about it now. Which means the few English-speakers who read Latin can appreciate it without the gnawing feeling they should emulate it—whether the emulation makes sense in English or not. (See infinitive, split.)

I don’t say this to dismiss the learning of Ancient Greek in Greece. I’m not even saying there aren’t things Modern Greek style can emulate from Ancient Greek: it has done, just as Gibbon owed a debt to Cicero. But none of that is inherent in the dative case. And at the end of the day, πεπαίκοιτον is more compact than “you two would have played”, but is it “better”? Objectively? Without considering which civilisation the word was at home to? And if it is, is Upper Sorbian byštej zahrałoj “you two would have played” any less “better”? How? How about Nenets manzarajidinz’ “you two would have worked”?

Right. More grocery calculations coming up.

Subscribe to Blog via Email

Email Address

Join 329 other subscribers
February 2026

M T W T F S S

1

2 3 4 5 6 7 8

9 10 11 12 13 14 15

16 17 18 19 20 21 22

23 24 25 26 27 28

« Jul