This page contains tools that are useful for textual analysis in financial applications and data from some of the textual-related publications I have with Tim Loughran. The essential method of textual analysis goes by various labels in other disciplines such as content analysis, natural language processing, information retrieval, or computational linguistics. A growing literature finds significant relations between stock price reactions and the sentiment of information releases as measured by word classifications such as those provided below.
If you would like to receive e-mail notifications of updates please send me an e-mail and I’ll put you on the update listserv. The data compilations provided on this website are for use by individual researchers. For commercial licenses please contact us.
Loughran-McDonald Sentiment Word LIsts
Loughran-McDonald Master Dictionary
A tabulation of all SEC EDGAR Filings by Type and Year
Note: We thank Cam Harvey and others who suggested some of the modifications we’ve included in these lists. The word lists are described in:
- Tim Loughran and Bill McDonald, 2011, “When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks,” Journal of Finance, 66:1, 35-65. (Available at SSRN: http://ssrn.com/abstract=1331573.)
- Andriy Bodnaruk, Tim Loughran and Bill McDonald, 2015, “Using 10-K Text to Gauge Financial Constraints,” Journal of Financial and Quantitative Analysis, 50:4, August 2015, 1-24. (Available at SSRN:http://ssrn.com/abstract=2331544.)
- Textual Analysis in Accounting and Finance: A Survey, with Tim Loughran, Journal of Accounting Research, 54:4, September 2016, 1187-1230. (Available at SSRN: http://ssrn.com/abstract=2504147.)
Although we provide each sentiment category below all of the word lists are contained in the Master Dictionary.
For WordStat users: WordStat .cat and .NFO files
Updated: March 2015
Derived from release 4.0 of 2of12inf. Extended to include words appearing in 10-K documents that are not found in the original 2of12inf word list. In addition to providing a master word list, the dictionary includes statistics for word frequencies in all 10-Ks from 1994-2014 (including 10-X variants). The dictionary reports counts, proportion of total, average proportion per document, standard deviation of proportion per document, document count (i.e., number of documents containing at least one occurrence of the word), eight sentiment category identifiers, Harvard Word List identifier, number of syllables, and source for each word. The sentiment categories are: negative, positive, uncertainty, litigious, modal, constraining. Modal words are flagged as 1, 2 or 3, with 1 = Strong Modal, 2 = Moderate Modal, and 3 = Weak Modal. The other sentiment words are flagged with a number indicating the year in which they were added to the list. Detailed documentation appears here.
As noted above, the Master Dictionary also tabulates all of the sentiment word lists. Each row in the Master Dictionary spreadsheet is a word. Sentiment word lists are identified by column, with members of the given set identified by non-zero entries. The non-zero entries represent the year in which the word was added to a given sentiment list.
1993-2016 SEC Filings by Type/Year: Master Index Analysis (click to download)
Updated: January 2017
The SEC's EDGAR website indexes all electronic filings in quarterly master files. This spreadsheet tabulates all of the various filings by year.
Stop words are generally words that are not considered to add information content to the question at hand. Thus, no universal list of stop words exists since what is considered uninformative depends on the context of your application. What I label below as "Generic" stop words are words such as "and", "the", or "of". The generic stop word list I provide is based on the stop word list used by Python's Natural Language Toolkit (NLTK), modified as follows:
- All one-letter words ("A", "I", "S", "T") - Generally we've found it best in parsing business documents to simply ignore one letter words. "A" and "I" are not critical in counting as a words and can be miscounted since they frequently are used to itemize lists. As with all word lists, your application might require modification. (For exmaple, if you're looking for CEO self-attribution, "I" becomes a critical word.)
- "DON" - more likely to be a name.
- "WILL" - a component of our "strong modal" sentiment list
- "AGAINST" - a component of our "negative" sentiment list
- Added: "AMONG"
Note that all of the lists are in uppercase, so you must remember to make equivalence comparisons with this in mind. We also add comments to some of the words using a pipe (|) delimiter, so the lists should be read accordingly.
As discussed before, there are many stopword lists available on the internet. In most cases we would recomment using our "generic" list. For completeness, we also provide a longer list labeled "GenericLong". The other lists are context specific and not necessarily exhaustive.