Subscribe to Blog via Email
October 2022 M T W T F S S « Nov 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Is there any NLP tool that can extract affix and stem of English words?
Yes, the Porter Stemmer is the most popular approach by far. See A survey of stemming algorithms in information retrieval for a survey, nltk.stem package for NLTK implementations, and Porter Stemming Algorithm for Porter’s own description of it. There are tweaks of it around, but noone has gone for anything different; and English being the way it is, there’s no real interest in the more powerful lemmatisers, which would do actual dictionary work.
As a linguist, I (and I’m sure many another linguist) am aghast at what the Porter Stemmer doesn’t do. stupider for example goes to stupid, but bigger does not go to big: Porter does not touch bisyllabic words—there’s too much risk of error. Similarly, Porter has no idea or interest in irregular forms.
It is a decent compromise on doing too much versus doing too little (and doing too much is a real problem). What people always forget is that it has to be customised, to deal with the vocabulary you’re likely to encounter, with an exceptions list. That applies in particular to its use in Lucene/SOLR.