For Beginners

Most of the feedback we receive from papers we’ve published in textual analysis is associated with individuals just beginning to explore the technology.  On this page we walk you through one approach for traversing the multitude of programming languages and development environments that can be used to analyze documents.
 

Which Programming Language Should I Use?

Most modern languages (or software packages) provide tools for textual analysis.  When we started working in this area a decade ago we found Microsoft’s Visual Studio to be most useful in terms of providing languages with a robust regular expression engine and a productive development environment.  As languages have evolved over the past decade, we have recently migrated all of our development efforts to Python.  Python is now a much more mature language, is open source, and has a rich suite of pre-coded routines to facilitate development.  We acknowledge that programming languages and operating systems are a matter of religion. R, SAS, Stata, MatLab and all of the usual suspects have their own advantages and disadvantages. We are simply providing one of many paths.

Python

One of the disadvantages of an open source platform, like Python, is the many variants that are available, which is somewhat overwhelming for someone just learning the technology. We would recommend using Python 3.x (versus 2.7). If you try to build your own Python environment, you will struggle resolving all of the versioning issues. We highly recommend downloading the Ananconda (Python 3.x / Windows or Mac X 64-Bit Graphical Installer) platform which will install Python 3.x and all of the most widely used supplementary modules. Ananconda does all of the work of making sure that the modules are compatible with the Python version embedded in their platform.

Much of the power of Python comes from pre-coded modules and packages.  For example, NumPy and SciPy provide scientific tools, Matplotlib facilitates plotting, pandas is used for data structures and analysis, and NLTK provides a natural language toolkit.  The advantage of these packages is that many times complex problems can be solved in a very small number of statements.  But, be careful. In textual analysis, frequently the prepackaged tools do not perform well on complex financial documents.  Textual analysis is not like inverting a matrix, where a routine developed in another context will still provide exactly the same result.  For example, using NLTK to produce an average-words-per-sentence measure for 10-Ks is very simple but produces disastrous results.

IDE

Writing code is much easier in an interactive development environment (IDE). (I appreciate that if you want to count yourself as a hardcore coder you probably would just use a very simple text editor and run everything in batch...on an Altair 8800.) The Python community offers a variety of IDEs, all with their own advantages and disadvantages.  We recommend PyCharm (Free Community version).

Regular Expressions

One of the key tools in textual analysis is the use of regular expressions (regex) to efficiently identify patterns in text.  For example, the expression:

(?<=\.) {2,}(?=[A-Z])

matches at least two spaces after a period when preceded by an upper case letter.  Using tools provided in Python, you can identify every occurrence of this sequence in a string variable (which might be an entire document).   Regex is typically a subset of most modern programming languages, however it should be noted that there are some minor style differences across the parent languages.  Beyond Python’s regex documentation, other sites provide useful introductory tutorials (for example, http://www.regular-expressions.info). If you plan on parsing text documents you will need to learn regex.  Getting a complex regex to work correctly for your specific document usually requires a fair amount of algorithm tuning.  (This includes those snippets you take from "expert" sources online.  A regex that works effectively on a novel might fail miserably on a 10-K filing.)

As with most modern computer languages, the best way to learn Python is to start coding and use online resources to work through problems. If you've programmed before in another language, the transition is fairly easy.  (Quick summary: spacing is important, lists and dictionaries are extremely powerful.) The central documentation platform for Python 3.x is here and most questions can be resolved using Google to find answers provided on stackoverflow.com. Of course there are many books available to help learn the language, but I haven’t found them to be any more effective than the online resources.  Some good online learning resources are the Python sections on Reddit and Codecademy.

 

Punchline

Download Ananconda, then PyCharm and start experimenting. We will attempt to provide some introductory examples on the software pages.