Is there any NLP tool that can extract affix and stem of English words?

By: | Post date: 2016-03-13 | Comments: No Comments
Posted in categories: English, Linguistics

Yes, the Porter Stemmer is the most popular approach by far. See A survey of stemming algorithms in information retrieval  for a survey,  nltk.stem package  for NLTK implementations, and Porter Stemming Algorithm  for Porter’s own description of it. There are tweaks of it around, but noone has gone for anything different; and English being the way it is, there’s no real interest in the more powerful lemmatisers, which would do actual dictionary work.

As a linguist, I (and I’m sure many another linguist) am aghast at what the Porter Stemmer doesn’t do. stupider for example goes to stupid, but bigger does not go to big: Porter does not touch bisyllabic words—there’s too much risk of error. Similarly, Porter has no idea or interest in irregular forms.

It is a decent compromise on doing too much versus doing too little (and doing too much is a real problem). What people always forget is that it has to be customised, to deal with the vocabulary you’re likely to encounter, with an exceptions list. That applies in particular to its use in Lucene/SOLR.

Leave a Reply

  • Subscribe to Blog via Email

  • November 2024
    M T W T F S S
     123
    45678910
    11121314151617
    18192021222324
    252627282930  
%d bloggers like this: