Is there any NLP tool that can extract affix and stem of English words?

By: | Post date: 2016-03-13 | Comments: No Comments
Posted in categories: English, Linguistics

Yes, the Porter Stemmer is the most popular approach by far. See A survey of stemming algorithms in information retrieval  for a survey,  nltk.stem package  for NLTK implementations, and Porter Stemming Algorithm  for Porter’s own description of it. There are tweaks of it around, but noone has gone for anything different; and English being the way it is, there’s no real interest in the more powerful lemmatisers, which would do actual dictionary work.

As a linguist, I (and I’m sure many another linguist) am aghast at what the Porter Stemmer doesn’t do. stupider for example goes to stupid, but bigger does not go to big: Porter does not touch bisyllabic words—there’s too much risk of error. Similarly, Porter has no idea or interest in irregular forms.

It is a decent compromise on doing too much versus doing too little (and doing too much is a real problem). What people always forget is that it has to be customised, to deal with the vocabulary you’re likely to encounter, with an exceptions list. That applies in particular to its use in Lucene/SOLR.

Leave a Reply

  • Subscribe to Blog via Email

    Join 287 other subscribers

  • October 2017
    M T W T F S S
    « Sep    
     1
    2345678
    9101112131415
    16171819202122
    23242526272829
    3031  
%d bloggers like this: