Stop words are generally words that are not considered to add information content to the question at hand. Thus, no universal list of stop words exists since what is considered uninformative depends on the context of your application. What I label below as "Generic" stop words are words such as "and", "the", or "of". The generic stop word list I provide is based on the stop word list used by Python's Natural Language Toolkit (NLTK), modified as follows:
- All one-letter words ("A", "I", "S", "T") - Generally we've found it best in parsing business documents to simply ignore one-letter words. "A" and "I" are not critical in counting as words and can be miscounted since they frequently are used to itemize lists. As with all word lists, your application might require modification. (For example, if you're looking for CEO self-attribution, "I" becomes a critical word.)
- "DON" - more likely to be a name.
- "WILL" - a component of our "strong modal" sentiment list
- "AGAINST" - a component of our "negative" sentiment list
- Added: "AMONG"
Note that all of the lists are in uppercase, so you must remember to make equivalence comparisons with this in mind. We also add comments to some of the words using a pipe (|) delimiter, so the lists should be read accordingly.
As discussed before, there are many stopword lists available on the internet. In most cases, we would recommend using our "generic" list. For completeness, we also provide a longer list labeled "GenericLong". The other lists are context-specific and not necessarily exhaustive.