This page contains tools useful for textual analysis in financial applications and data from some of the textual-related publications I have with Tim Loughran. The essential method of textual analysis goes by various labels in other disciplines such as content analysis, natural language processing, information retrieval, or computational linguistics. A growing literature finds significant relations between financial phenomena (e.g., stock returns, commodity prices, bankruptcies, governance) and the sentiment of financial disclosures as measured by word classifications such as those provided below.
If you would like to receive e-mail notifications of updates please send me an e-mail and I’ll put you on the update listserv. The data compilations provided on this website are for use by individual researchers. For commercial licenses, please contact us.
Loughran-McDonald Master Dictionary - Dictionary used to determine which tokens (collections of characters) are classified as words. Also includes sentiment word classifications. (v. 2020)
A tabulation of all SEC EDGAR Filings by Type and Year - simple counts of all EDGAR filing types by year. (v. 2020)
- Stopwords - various stopword lists. (v. 2020)
Updated: July 2020
- Derived from release 4.0 of 2of12inf. This is a fairly common baseline dictionary and is oriented towards common words. The 2of12inf dictionary contains word inflections but does not contain abbreviations, acronyms, British English, hyphenated words, names, or phrases.
- We extend the 2of12inf baseline dictionary to include words appearing in 10-K documents and earnings calls that are not found in the original 2of12inf word list by examining tokens from all 10-K type filings for the full EDGAR 10-K archive and earnings calls from CapIQ. We have added words to the original 2of12inf dictionary that are either an inflection of more commonly appearing words or words that appear in more than 5% of the documents,
- The dictionary reports counts, proportion of total, average proportion per document, standard deviation of proportion per document, document count (i.e., number of documents containing at least one occurrence of the word), eight sentiment category identifiers, number of syllables, and source for each word (source is either 12of12inf or the year in which the word was added).
- The sentiment categories are: negative, positive, uncertainty, litigious, strong modal, weak modal, constraining, and complexity. The sentiment words are flagged with a number indicating the year in which they were added to the list - a year preceded by a negative sign indicates the year/version when the word was removed from the sentiment category.
- Although the dictionary does not, in general, include abbreviations, in the current revision we have added a limited number of abbreviations (20) commonly occurring in the periodic filings
- From the parsing of the most recent years, we added 25 new words to the dictionary in the 2020 version.
- Detailed documentation appears here.
Sentiment Word Lists
As noted above, the Master Dictionary also tabulates all of the sentiment word lists. For commercial licenses, please contact us.
The word lists are described in:
- Tim Loughran and Bill McDonald, 2011, When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks, Journal of Finance, 66:1, 35-65. (Available at SSRN: http://ssrn.com/abstract=1331573.)
- Andriy Bodnaruk, Tim Loughran and Bill McDonald, 2015, Using 10-K Text to Gauge Financial Constraints, Journal of Financial and Quantitative Analysis, 50:4, 1-24. (Available at SSRN:http://ssrn.com/abstract=2331544.)
- Tim Loughran and Bill McDonald, 2016, Textual Analysis in Accounting and Finance: A Survey, Journal of Accounting Research, 54:4,1187-1230. (Available at SSRN: http://ssrn.com/abstract=2504147.)
- Tim Loughran and Bill McDonald, 2021, Measuring Complexity, SSRN: https://ssrn.com//abstract=3645372
We thank Cam Harvey and others who have suggested some of the modifications and updates we’ve included in these lists.
For WordStat users: WordStat .cat and .NFO files (2018 version)
1993-2020 SEC Filings by Type/Year: Master Index Analysis (click to download)
Updated: July 2020
The SEC's EDGAR website indexes all electronic filings in quarterly master files. This spreadsheet tabulates all of the various filings by year.
Stop words are generally words that are not considered to add information content to the question at hand. Thus, no universal list of stop words exists since what is considered uninformative depends on the context of your application. What I label below as "Generic" stop words are words such as "and", "the", or "of". The generic stop word list I provide is based on the stop word list used by Python's Natural Language Toolkit (NLTK), modified as follows:
- All one-letter words ("A", "I", "S", "T") - Generally we've found it best in parsing business documents to simply ignore one-letter words. "A" and "I" are not critical in counting as words and can be miscounted since they frequently are used to itemize lists. As with all word lists, your application might require modification. (For example, if you're looking for CEO self-attribution, "I" becomes a critical word.)
- "DON" - more likely to be a name.
- "WILL" - a component of our "strong modal" sentiment list
- "AGAINST" - a component of our "negative" sentiment list
- Added: "AMONG"
Note that all of the lists are in uppercase, so you must remember to make equivalence comparisons with this in mind. We also add comments to some of the words using a pipe (|) delimiter, so the lists should be read accordingly.
As discussed before, there are many stopword lists available on the internet. In most cases, we would recommend using our "generic" list. For completeness, we also provide a longer list labeled "GenericLong". The other lists are context-specific and not necessarily exhaustive.