Stage One 10-X Parse Data
** Download the zipped data files by year here. **
Documentation for Stage One 10-X Parse
1. Author: Professor Bill McDonald
Mendoza College of Business
University of Notre Dame
Notre Dame, IN 46556
2. Stage One Parse
“10-X” represents any Securities and Exchange (SEC) filing that is a 10-K variant, e.g., 10-Q, 10-K/A, 10-K405, etc. These annual and quarterly filings are required by any issuer with securities registered under Section 12 or subject to Section 15(d) of the SEC of the Securities Exchange Act of 1934, as amended, and subject to the periodic and current reporting requirements of Section13 or 15(d).
I provide two primary data sources associated with 10-X filings on the Security and Exchange Commission’s (SEC) EDGAR website. The first is labeled as the “Stage One Parse” which essentially cleans each filing document of extraneous materials and is described in detail below. Separately I provide data associated with a “Stage Two Parse”, which parses each document into tokens and tabulates various word counts. All of these data are available at https://sraf.nd.edu/data/.
A substantial portion of an EDGAR text filing’s content consists of HTML code, embedded PDF’s, jpg’s and other artifacts not typically of interest. The complete filesize for some of the largest filings exceeds 400MB. To the extent your research does not require some of these artifacts, the parsing process can be made orders of magnitude more efficient by extracting these items and creating compressed versions of the filings. For example, after our Stage One Parse, the largest file is less than 5KB. These considerations are most relevant for the annual and quarterly filings of firms (annual and quarterly reports pursuant to Section13 of 15(d)), which is the focus of this process.
All 10-X SEC complete text document filings are downloaded for each year/quarter. As a precautionary note, the EDGAR site's files are generally reliable but I have found based on past experience that they are not wholly stable. For example, a few years back one of the master index files was corrupted and contained only a portion of the full index. Also, this past year, 110 files were missing from the fourth quarter of 2015. I try to validate the downloads using past versions of the data whenever possible to avoid some of these errors.
The text version of the filings provided on the SEC server is an aggregation of all information provided in the browser-friendly files also listed on EDGAR for a specific filing. For example, IBM’s 10-K filing on 20120228 lists the core 10-K document in HTML format, ten exhibits, four jpg (graphics) files, and six XBRL files. All of these files are also contained in a single text file with the embedded HTML, XBRL, exhibits, and the ASCII-encoded graphics. In the IBM example, of the 48,253,491 characters contained in the file, only about 7.6% account for the 10‑K text including the exhibits and tables. The HTML coding accounts for about 55% of the file. The XBRL tables have a very high ratio of tags to data and account for about 33% of the text file. The remaining 27% of the file is attributable to the ASCII-encoded graphics. In many cases, ASCII-encoded pdfs, graphics, xls, or other binary files that have been encoded can account for more than 90% of the document.
Because most textual analysis studies focus on the textual content of the document, the Stage One Parse creates files where all of the 10-X documents have been parsed to exclude markup tags, ASCII-encoded graphics, and tables. We exclude tables because they are usually not the focus of textual analysis. Certainly, one can imagine research where any of these excluded items might require analysis of the original filings, however, the compressed files provide a standardized and efficient way of facilitating those studies focused on text.
3. Markup Tags
All of the original markup language tags (HTML, XBRL, XML) are deleted from the original document. We insert our own markup tags within a header at the beginning of the compressed document and tags to delineate all exhibits in the document. The structure of the tagging system is as follows:
- The following information appears at the beginning of each file:
… All text contained in the original SEC-Header
Note that the <FileStats> data contain character counts for the size of the raw file, the post-parsing size, and character counts for the items deleted from the document.
All exhibits preceded by the original tags of “<TYPE>EX-##” are encapsulated in the parsed files as:
… original text
- All exhibits preceded by the original tags of “<TYPE>EX-##” are encapsulated in the parsed files as:
4. Parsing Details
Each raw text file downloaded from EDGAR is parsed using the following sequence.
- Remove ASCII-Encoded segments – All document segment <TYPE> tags of GRAPHIC, ZIP, EXCEL and PDF are deleted from the file. ASCII-encoding is a means of converting binary-type files into standard ASCII characters to facilitate transfer across various hardware platforms. A relatively small graphic can create a substantial ASCII segment. Filings containing multiple graphics can be orders of magnitude larger than those containing only textual information.
- Remove <DIV>, <TR>, <TD>, and <FONT> tags – Although we require some HTML information for subsequent parsing, the files are so large (and processed as a single string) that, for processing efficiency, we initially simply strip out some of the formatting HTML.
- Remove all XML – all XML embedded documents are removed.
- Remove all XBRL – all characters between <XBRL …> … </XBRL> are deleted.
- Remove SEC Header/Footer – All characters from the beginning of the original file thru </SEC-HEADER> (or </IMS-HEADER> in some older documents) are deleted from the file. Note however that the header information is retained and included in the tagged items discussed in section 4.1. In addition, the footer “-----END PRIVACY-ENHANCED MESSAGE-----” appearing at the end of each document is deleted.
- Replace \&NBSP and \  with a blank space.
- Replace \& and \& with “&”
- Remove all remaining extended character references (ISO-8859-1, see http://www.sec.gov/info/edgar/edgarfm-vol2-v34.pdf section 220.127.116.11.
- Remove tables – all characters appearing between <TABLE> and </TABLE> tags are removed.
- Note that some filers use table tags to demark paragraphs of text, so each potential table string is first stripped of all HTML and then the number of numeric versus alphabetic characters is compared. For this parsing, only table encapsulated strings where numeric chars/(alphabetic+numeric chars) > 10% are removed.
- In some instances, Item 7 and/or Item 8 of the filings begins with a table of data where the Item 7 or 8 demarcation appears as a line within the table string. Thus, any table string containing “Item 7” or “Item 8” (case insensitive) is not deleted.
- Tag Exhibits – At this point in the parsing process all exhibits are tagged as discussed in section 3.2.
- Remove Markup Tags – remove all remaining markup tags (i.e., <…>).
- Excess linefeeds are removed.
Download the zipped 10-X files here.
 Specifically the following forms are included:
f_10K = ['10-K', '10-K405', '10KSB', '10-KSB', '10KSB40']
f_10KA = ['10-K/A', '10-K405/A', '10KSB/A', '10-KSB/A', '10KSB40/A']
f_10KT = ['10-KT', '10KT405', '10-KT/A', '10KT405/A']
f_10Q = ['10-Q', '10QSB', '10-QSB']
f_10QA = ['10-Q/A', '10QSB/A', '10-QSB/A']
f_10QT = ['10-QT', '10-QT/A']
Note that the 10-KSB and 10-QSB forms are actually mislabeled filings, such as NBG Radio (CIK=1059366, FilingDate=20020228). These are relatively rare. We do not include Form 20-F which is required for foreign companies with less than 50% of their shares traded on a US exchange.
 XBRL (eXtensible Business Reporting Language) is a markup language. A variant of XML and related to HTML, it provides semantic context for data reported within a 10-K. For example, one line in Google’s 20111231 10-K filing contains “<us-gaap:StockholdersEquity contextRef="eol_PE633170--1110-K0018_STD_0_20081231_0" unitRef="iso4217_USD" decimals="-6">28239000000</us-gaap:StockholdersEquity>”. The “eol …” segment defines the XBRL implementation, the data are in US dollars and the “-6” indicates the number is rounded to millions. See http://xbrl.sec.gov. A few firms began including XBRL in their filings in 2005 with the number expanding substantially in 2010.
 ASCII-encoding converts binary data files to plain ASCII-printable characters, thus ensuring cross platform conformity. The conversion from binary to plain text increases the size of the original file by orders of magnitude.