Comparison, TLG BC and AD

By: | Post date: 2010-02-01 | Comments: 6 Comments
Posted in categories: Ancient Greek, Linguistics
Tags: , , , ,

In the previous post, I used Wordle to illustrate stop words in Greek (and, by the by, the exponential distribution of function words following Zipf’s Law). After getting rid of a whole bunch of stop words, I ended up with a Wordle of the lemmata of the TLG:

But I stopped short of making sense of the Wordle, because the TLG contains both Ancient and Mediaeval texts, and they talk about different things. I promised Wordles of the texts in the TLG from BC and AD, which will give at least a rough sense of the difference.

So here they are:


Images created by the Wordle.net web application are licensed under a Creative Commons Attribution 3.0 United States License.

The Wordle images are hyperlinked to the Wordle applets hosted there, so you can play with the applets by eliminating words. The stopwords are as before, but I also got rid of πολύς “much”, which was crowding the BC texts a bit much.

A few things jump out quickly: there’s a lot more God AD, as you’d expect (θεός), slightly more talk of “people” than of “men” (ἄνθρωπος, ἀνήρ), less talk of the City and more talk of power (πόλις, δύναμις).

But I’m not really a visual person, so I’m going to use more quantitative ways of working out the changes in vocabulary.

To begin with, the two Wordles show the 150 most frequent lemmata for each period, not counting stop words. These are the differences between the two—words in the top 150 of one period, but not the other.

Ancients talked more about…

and less about…

Ἕλλην, Ἀθηναῖος, Ζεύς, ἀμφότερος, διαφέρω, ἑκάτερος, ἐλάσσων, εἶμι, εὐ, ἤλιος, ἡγέομαι, ἱερός, κεῖμαι, κύκλος, ναῦς, νέος, νομίζω, ὀρθός, οἶκος, πλέως, πλεῖστος, πλῆθος, πόλεμος, πολέμιος, ποταμός, θάλασσα, θεά, σημεῖον, ταχύς, ὔστερος, χρῆμα, χώρα, ζῷον

Χριστός, ἅγιος, ἁπλόος, ἄξιος, ἀδελφός, ἀλήθεια, βασιλεία, δέχομαι, δηλόω, δόξα, ἐκκλησία, ἐνέργεια, εἶδος, φωνή, κίνησις, κόσμος, νόος, οἰκεῖος, οὐρανός, οὐσία, πάθος, πίστις, πνεῦμα, πρόσωπον, θάνατος, θεῖος, σάρξ, τέλος, τρίτος, χάρις, ζητέω, ζωή

Greek, Athenian, Zeus, both, to differ, either, less, go, good, sun, to lead, dawn, holy, to lie, circle, ship, new, to think, right, house, full, most, crowd, war, enemy, river, sea, goddess, point, fast, last, need, land, animal

Christ, holy, simple, worthy, brother, truth, kingdom, to accept, to declare, glory, church, activity, form, voice, movement, world, mind, own, heaven, substance, passion, faith, spirit, face, death, divine, flesh, end, third, grace, to ask, life

The effect of Christianity on vocabulary use is pretty obvious. A few other changes are worth noting:

  • Byzantines nominalised a lot more than Ancients did. That’s at last some of the reason for ἀλήθεια “truth” (instead of the more Attic τὸ ἀληθές “the true”), and it may relate to other nominalisations like κίνησις “movement” and ἐνέργεια “activity”. (βασιλεία “kingdom” has a Biblical pedigree—but that is also because the Bible was not written in Attic.)
  • Many of the differences are a matter of language change, rather than different ideology. For all that most Byzantines did not write in the vernacular, their language was usually more akin to Koine than to Attic. That explains the absence of εἶμι, εὐ, ναῦς, πλέως, ἐλάσσων, ἱερός, πολέμιος (replaced by στέλλω, καλός, πλοῖον, πλήρης, μικρότερος/ὀλιγότερος, ἄγιος, ἐχθρός) “send, good, ship, full, less, holy, enemy”, and presumably also the avoidance of ἀμφότερος and ἑκάτερος “both, either”.

I’ve left out from those lists words that show up in the top 150 only because they’re ambiguous with other legitimate words. (Yes, I should have pruned the Wordles.)

  • BC: δίκαιον, δοκεύς, ἠώς, θέα: rights, beam, dawn, view
  • AD: ἅγιον, βασίλειος, ἴδιον, κενόω, πρόσωπος, ζωός: sanctuary, royal, particularity, make void, face, alive

There’s one further comparison I’ll attempt: the words whose frequency changed the most between the two periods. To track this, I’m going to use the 2000 most frequent lemmata for each period—including both normal words and stop words; that constraint means we’re only looking at words that are likely to matter. I’ll go through the lemmata in those lists whose ranking changed by the greatest amount (e.g. from #1537 to #10342).

Because it’s a pretty heterogeneous list—and different kinds of words tells us different things, I’ll split them up into categories. (And I will do some silent suppressing of ill-recognised ambiguous words.)

These are the biggest shifts in proper names:

Ancients talked more about…

Rank Shift

Ἔφορος

Ephorus

-8530

Ποσειδώνιος

Posidonus

-8397

Πελοποννήσιος

Peloponnesian

-6655

Αἰτωλός

Aetolian

-5399

Ἑκαταῖος

Hecataeus

-5157

Θεόπομπος

Theopomus

-5046

Ἀπολλόδωρος

Apollodorus

-4948

Φωκεύς

Phocian

-4786

Τυρρηνικός

Tyrrhenian

-4587

Χρύσιππος

Chrysippus

-4043

Two things are going on here. First, some ancient authorities—primarily historians, if I read the names right—were of interest to several ancient writers, but of less interest to the Byzantines. They tend to be the historians whose texts didn’t survive, which is related to them being of less interest to the Byzantines. (I don’t know offhand whether that’s cause or effect.)

Second, Greece was very important to Ancient Greeks, and so were the various regions of Greece. To the Byzantines though, Greece was a backwater, and the old regions did not survive into the Byzantine system of themes. So there was no reason to talk about Aetolia or Phocia outside of Ancient History; and less reason to talk about the Peloponnese than you might think, even while the name survived. The same goes for Tyrrhenians: it wasn’t Etruscans that the Byzantines were having to deal with in Italy, but Lombards.

Ancients talked less about…

Rank Shift

Κύριλλος

Cyril

+214,509

Κωνσταντινούπολις

Constantinople

+214,399

Γρηγόριος

Gregory

+214,391

Ἀθανάσιος

Athanasius

+214,154

Γεώργιος

George

+85,856

Κωνσταντῖνος

Constantine

+47,064

Πέτρος

Peter

+40,947

Χριστιανός

Christian

+36,162

Βασίλειος

Basil

+28,217

Χριστός

Christ

+23,988

The only surprise is that Christians turn in BC texts at all; there’s only 5 instances though, and the dating of texts in the corpus is porous (late citations can appear as testimonia of earlier authors).

These are the biggest shifts in common nominals:

Ancients talked more about…

Rank Shift

εὔδοξος

reputable

-8805

κύλινδρος

cylinder

-6569

ἀσύμμετρος

asymmetrical

-5939

δημοκρατία

democracy

-5389

πυραμίς

pyramid

-4714

ναυμαχία

sea battle

-4274

κῶνος

cone

-4205

παραλληλόγραμμος

parallelogram

-3837

παρεμβολή

interpolation; encampment

-3668

ψήφισμα

decree passed by vote

-3194

If the AD texts have more theology, they clearly have a lot less geometry, and a lot less to do with representational systems of government. The drop in εὔδοξος is surprising, given it’s in Plato; I wonder if the change of -δοξ- in compound from “reputations” to “glory” made the adjective confusing for later writers.

Ancients talked less about…

Rank Shift

ἀποστολικός

apostolic

+214,282

θεοτόκος

God-bearing (Theotokos)

+85,966

βάπτισμα

baptism

+85,945

θεότης

divinity

+59,016

μόδιος

bushel

+58,616

μοναστήριον

monastery

+57,696

σεβάσμιος

reverend

+57,691

αἱρετικός

heretic

+35,602

χάρισμα

(spiritual) gift

+35,588

πατριάρχης

patriarch

+27,001

No surprises again; the only non-religious term is μόδιος “bushel”, both as a vessel and a measure.

These are the biggest shifts in verbs:

Ancients talked more about…

Rank Shift

διαπορεύω

pass across

-5939

βλώσκω

go

-5710

εἰσοράω

look upon

-4134

ἄημι

blow (wind)

-4113

κλύω

hear

-3392

ἐπιζεύγνυμι

join to

-3039

ἀμφισβητέω

doubt

-2436

ἐφάπτω

hang on

-2122

ἱκνέομαι

come

-2088

μεταπέμπω

send for

-1974

Many of the missing verbs are poetic and/or dialectal, and would not have a natural place in Byzantine prose; that includes βλώσκω, εἰσοράω, ἄημι, κλύω, ἱκνέομαι. The surprise here is the vanishing of doubt in the Middle Ages.

… Yes, yes, the jokes just write themselves, I know…

Ancients talked less about…

Rank Shift

ἐνάγω

persuade

+10,366

βαπτίζω

baptise

+8529

ψάλλω

chant

+5021

φανερόω

reveal

+3988

καταδικάζω

condemn

+3483

φωτίζω

illuminate

+3308

περισπάω

take a circumflex

+2948

βαστάζω

carry

+2911

ἀνέρχομαι

go up

+2809

προλαμβάνω

anticipate

+2769

I admit to being less sure about some the shifts here, such as ἐνάγω and προλαμβάνω. The Christian influence is clear in βαπτίζω, ψάλλω, φανερόω and φωτίζω. Language change accounts for βαστάζω and ἀνέρχομαι replacing φέρω and ἄνειμι, and I assume καταδικάζω for “condemn” replaced what came to look like more generic verbs, in καθαιρέω or καταγιγνώσκω. And unlike the Ancients, the Byzantines had to learn about polytonic orthography; so what word took a circumflex and what word took an acute was a matter much ink was spilled about.

Finally, these are the biggest shifts in function words:

Ancients talked more about…

Rank Shift

τοτέ

at times

-5796

αὖτε

again

-4707

δισχίλιοι

two thousand

-4676

αἴ

alas

-3334

πεντακόσιοι

five hundred

-2859

ἠέ

or

-2844

νή

[I swear] by [deity]

-2663

διακόσιοι

two hundred

-2597

μά

yea

-2470

πω

yet, at all

-2466

There is some Epic dialect here, in αὖτε and ἠέ; some strictly Attic rather than Koine words in τοτέ, πω, and δισχίλιοι; and a rather different approach to exclamations, with the old oaths by the Gods dispensed with, and the ai!‘s of tragedy avoided in theological discourse. (There are 2100 instances AD of φεῦ “alas”; maybe αἴ was too specific to tragedy? *shrug*) Not sure why the written-out 500 and 200 were less popular. Maybe the armies just got bigger, so historians talked in the thousands instead of 300

Ancients talked less about…

Rank Shift

ἀμήν

amen

+19,195

νά

to (Modern Greek)

+18,984

ἀλλαχοῦ

elsewhere

+7689

δηλαδή

that is

+7587

ἤγουν

that is

+6367

ιζ΄

XVII

+4541

καθό

insofar as

+4524

ιϛʹ

XVI

+4001

ιηʹ

XVIII

+3727

ιδʹ

XIV

+3196

It’s obvious why amen is there; it’s also obvious why να, the Modern Greek equivalent of the ancient infinitive inflection, is there. ἀλλαχοῦ for “elsewhere” is attested in Sophocles and Xenophon, but it became prevalent much later, and LSJ reports that Moeris proscribed it as vernacular, in favour of ἄλλοθι. The other conjunctions are run-in phrases, which Byzantine texts in general are rather more sympathetic to treating as single words than are ancient texts: δῆλα δή “so [they are] obvious”, ἤ γε οὖν “or indeed then”, καθ’ ὅ “according to what”.

Finally, the numerals aren’t there because the Byzantines were more numerate than the Ancients. After all, the Byzantines had given up on geometry, from what the counts tell us. (And that’s a silly enough thing to conclude that you should not take much of this too seriously.) No, the reason there’s a whole lot of XVII’s and XIV’s in the AD corpus is that there are a lot more chapter headings in the theologians…

6 Comments

  • Helma says:

    Thanks for that comment. Especially amusing to me since my more statistically trained, non-classicist, co-conspirators try to keep me from implementing word-specific rules:-) I should trust the system to take care of it, after all. So far, they've pretty much been right, but I'm always tempted.

  • opoudjis says:

    The distinction is more a matter of degree: I've been more conservative about displacing ambiguous parses, for the reasons you indicate; but there is a system of ranking analyses in place—especially because the TLG lemmatiser, because of the nature of its corpus, overgenerates analyses.

    The disambiguation is rule-driven rather than stochastic (since words are analysed in isolation currently); its criteria are morphological and lexical. (So δοκεύς, a hapax pretty much, won't be competing with δοκέω any more: I've added a lexical rule to that effect.) But the morphological rules are of limited scope, or else conservatively applied.

  • Helma says:

    Hi Nick,
    I guess I like cluesticks as much as anyone 🙂

    So let me leave another comment here. I think that you'll find that morphological disambiguation (less work than syntax) would not be 100% for things like ἔχις and δοκεύς (another stand-out, for obvious reasons) but quite successful nonetheless. In my corpus, after a really small amount of disambiguation, I get frequencies for ἔχω 1000 times higher than for ἔχις and few mishits, that are mostly there because the system in its first go-round had not seen a lot of postpositions yet (and because we didn't deal with caps and apostrophes well. oops!). We'll soon do a round of re-training with more data, so we'll see what happens then.
    But I understand that in what it makes available for the general user, TLG has decided to give users the option of looking for all δοκεῖ or ἔχεις instances, without taking a first stab at which is which. I guess this fits in with the general philosophy of the project and the obsessions of the average classicist — the search for completeness above all (collecting all the trees in the Greek forest). Because I'm interested in frequency distribution *and* fast searching, I've taken the other decision: go with the probables, not the possibles, for the sake of searching, lemma frequency, collocation, etc.

  • opoudjis says:

    Helma, hello, and good to hear from you again.

    Syntactic disambiguation will do some things, agreed, but I don't think it will go so far as "wonders" for the vipers. I played with syntactic disambiguation in 2004—although I should revisit it now that the lemmatiser is doing a better job: I used the collocation of parts of speech and inflectional categories for words that were unambiguous, and tried to apply those to words that were ambiguous.

    My findings at the time were, that kind of disambiguation would deal with a quarter to a third of all ambiguous word instances. Certainly a meaningful step, but it wouldn't get rid of "viper" completely. Moreover, though I went as far as a three word window either side of the ambiguous word, the only consistently reliable syntactic cue I found was preceding definite article (which cannot precede a verb). And that's because it combines a syntactic restriction with inflectional agreement.

    Otherwise, syntactic restrictions in Ancient Greek are thin on the ground: articles, prepositions, and that's about it. The word order is too free for such a restriction to emerge through pure statistics. Which means I know τῷ ἔχει is "to the viper" and not *"to the she has"; but I can't do as much with σὺ ἔχεις "thou hast". For starters, ἔχεις could be the accusative plural, and could occur in a pattern like σὺ ἔχεις ἐφόνευσας "thou murderedst vipers". For seconds, pronouns like σύ are rare in Greek in general, so they wouldn't account for a lot of instances.

    I don't want to be uncharitable to Wordle, because the "first glance" still counts for something. But yes, it's only an initial glance.

    Thank you for the cluestick on Dunning via Mueller. I'm going to blunt-instrument use the formulas as interpreted by Wordhoard, and see if the Wordles I get that way are more informative.

  • Helma says:

    You should really try a bit of disambiguation sometime (doesn't take a lot of text) — that will do wonders to your viper presence:-) Lots of (τοὺς) ἔχεις are pretty easy to distinguish from (σὺ) ἔχεις.
    But, more to the point, if you really want to compare two corpora with Wordles, check out Martin Mueller's Wordles (enter him as a Wordle user to find them). Neat stuff that contrasts Iliad and Odyssey, and similar. So his Wordles are not raw counts but differentials (more specifically, Dunning's log likelihood ratio). For English stuff, check out his Jane_Austen_avoids (used less in Austen than in her contemporary novelists). By the way, I'm completely in agreement that Wordles are fairly useless as an analytical tool after the first glance. Ordering by frequency, or alphabetically, gets you where you want to be sooner.

  • filologanoga says:

    Wow! Stuff for thought.

Leave a Reply

  • Subscribe to Blog via Email

  • October 2024
    M T W T F S S
     123456
    78910111213
    14151617181920
    21222324252627
    28293031  
%d bloggers like this: