This page contains tools useful for textual analysis in financial applications and data from some of the textual-related publications I have with Tim Loughran. The essential method of textual analysis goes by various labels in other disciplines such as content analysis, natural language processing, information retrieval, or computational linguistics. A growing literature finds significant relations between stock price reactions and the sentiment of information releases as measured by word classifications such as those provided below.
If you would like to receive e-mail notifications of updates please send me an e-mail and I’ll put you on the update listserv. The data compilations provided on this website are for use by individual researchers. For commercial licenses please contact us.
Loughran-McDonald Sentiment Word Lists - text files containing each of the LM sentiment words by category (Negative, Positive, Uncertainty, Litigious, Modal, Constraining)
Loughran-McDonald Master Dictionary - Dictionary used to determine which tokens (collections of characters) are classified as words. Also contains sentiment classifications, counts across all filings, and other useful information about each word.
Loughran-McDonald 10X File summaries - A file containing sentiment counts, file size and other measures for all 10-X filings for all years.
A tabulation of all SEC EDGAR Filings by Type and Year - simple counts of all EDGAR filing types by year.
- Stopwords - various stopword lists.
Note: We thank Cam Harvey and others who suggested some of the modifications we’ve included in these lists. The word lists are described in:
- Tim Loughran and Bill McDonald, 2011, When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks, Journal of Finance, 66:1, 35-65. (Available at SSRN: http://ssrn.com/abstract=1331573.)
- Andriy Bodnaruk, Tim Loughran and Bill McDonald, 2015, Using 10-K Text to Gauge Financial Constraints, Journal of Financial and Quantitative Analysis, 50:4, 1-24. (Available at SSRN:http://ssrn.com/abstract=2331544.)
- Tim Loughran and Bill McDonald, 2016, Textual Analysis in Accounting and Finance: A Survey, Journal of Accounting Research, 54:4,1187-1230. (Available at SSRN: http://ssrn.com/abstract=2504147.)
Although we provide each sentiment category below all of the word lists are contained in the Master Dictionary.
For WordStat users: WordStat .cat and .NFO files
Updated: June 2017
Derived from release 4.0 of 2of12inf. Extended to include words appearing in 10-K documents that are not found in the original 2of12inf word list. In addition to providing a master word list, the dictionary includes statistics for word frequencies in all 10-Ks from 1994-2016 (including 10-X variants - see detailed documentation). The dictionary reports counts, proportion of total, average proportion per document, standard deviation of proportion per document, document count (i.e., number of documents containing at least one occurrence of the word), eight sentiment category identifiers, Harvard Word List identifier, number of syllables, and source for each word. The sentiment categories are: negative, positive, uncertainty, litigious, modal, constraining. Modal words are flagged as 1, 2 or 3, with 1 = Strong Modal, 2 = Moderate Modal, and 3 = Weak Modal. The other sentiment words are flagged with a number indicating the year in which they were added to the list. Detailed documentation appears here.
As noted above, the Master Dictionary also tabulates all of the sentiment word lists. Each row in the Master Dictionary spreadsheet is a word. Sentiment word lists are identified by column, with members of the given set identified by non-zero entries. The non-zero entries represent the year in which the word was added to a given sentiment list.
Download file here (187MB)
Using the Stage One parsed files (here), a dataset is created containing summary data for each filing. A Python class for this module is available here. This file contains a header record with labels and is comma delimited. Each record reports:
- CIK – the SEC Central Index Key
- FILING_DATE – the filing date (YYYYMMDD) for the form
- FYE – fiscal-year-end as reported in the filing.
- FORM_TYPE – the specific form type (e.g., 10-K, 10-K/A, 10-Q405, etc.)
- FILE_NAME – the local file name for the filing
- SIC – the four digit SIC reported in the header of the filing. If this number does not appear in the header, then the primary web page for all filings from that firm at EDGAR is parsed in an attempt to identify the SIC number. If all of these methods fail, an SIC of -99 is assigned.
- FFInd – the Fama-French 48 industry classification based on the SIC number. All missing SIC’s are assigned to the miscellaneous category.
- N_Words – the count of all words, where a word is any token appearing in the Master Dicitonary.
- N_Unique – the number of words occurring at least once in the document.
- A sequence of sentiment counts – negative, positive, uncertainty, litigious, weak modal, moderate modal, strong modal, constraining.
- N_Negation—a count of cases where negation occurs within four or fewer words from a word identified as positive. Negation words are (“no, not, none, neither, never, nobody”, see Gunnel Totie, 1991, Negation in Speech and Writing). Thus the net positive words is the positive word count minus the count for Negation. Although the technique seems reasonable, most important cases of negation are sufficiently subtle that most algorithms will not pick them up.
Statistics derived from the Stage One Parse
- GrossFileSize – the total number of characters in the original filing.
- NetFileSize – the total number of characters in the filing after the Stage One Parse.
- ASCIIEncodedChars – the total number of ASCII Encoded characters (e.g., &)
- HTMLChars – the total number of characters attributable to HTML encoding.
- XBRLChars – the total number of characters attributable to XBRL encoding.
- XMLChars – the total number of characters attributable to XML encoding.
- N_Tables – number of tables in the filing.
- N_Exhibits – number of exhibits in the filing.
1993-2016 SEC Filings by Type/Year: Master Index Analysis (click to download)
Updated: January 2017
The SEC's EDGAR website indexes all electronic filings in quarterly master files. This spreadsheet tabulates all of the various filings by year.
Stop words are generally words that are not considered to add information content to the question at hand. Thus, no universal list of stop words exists since what is considered uninformative depends on the context of your application. What I label below as "Generic" stop words are words such as "and", "the", or "of". The generic stop word list I provide is based on the stop word list used by Python's Natural Language Toolkit (NLTK), modified as follows:
- All one-letter words ("A", "I", "S", "T") - Generally we've found it best in parsing business documents to simply ignore one letter words. "A" and "I" are not critical in counting as a words and can be miscounted since they frequently are used to itemize lists. As with all word lists, your application might require modification. (For example, if you're looking for CEO self-attribution, "I" becomes a critical word.)
- "DON" - more likely to be a name.
- "WILL" - a component of our "strong modal" sentiment list
- "AGAINST" - a component of our "negative" sentiment list
- Added: "AMONG"
Note that all of the lists are in uppercase, so you must remember to make equivalence comparisons with this in mind. We also add comments to some of the words using a pipe (|) delimiter, so the lists should be read accordingly.
As discussed before, there are many stopword lists available on the internet. In most cases we would recommend using our "generic" list. For completeness, we also provide a longer list labeled "GenericLong". The other lists are context specific and not necessarily exhaustive.