Home > Text Analysis

Text Analysis

Digital text analysis involves the destruction of a body of texts. In contrast to a close-reading approach, where every sentence is read and considered, digital text analysis involves the pulling apart, plucking out, breaking down and eventual clumping back together of words, phrases and sentences in a textual corpus. In comparison to the patient care of the close reader, distant reading can often be a violent rearranging and recombination of texts at the will of the interpreter. We constructed our corpus, or body of text, in order to digitally take it apart. In this building and breaking process a “whole economy of power is invested” (Foucault, 34). In breaking down the text and building it back up, we make many conscious choices: what to exclude what to include, and what and how to present our findings to the reader as a complete object, which lacks, for the sake of comprehensibility,  many of the traces of violence enacted on the corpus itself. Our power is largely invisible. Yet, we also allow the reader to gain an insight into a daunting number of texts. A small number of scholars could read 1429 documents in German, French, English and Latin, and in the Fraktur font , but with OCR, topic modeling and Voyant we can do a distant reading of a large corpus and present those readings along with their translations to a broad audience. As Moretti wrote about distant reading, “We always pay a price for theoretical knowledge: reality is infinitely rich; concepts are abstract, are poor. But it’s precisely this ‘poverty’ that makes it possible to handle them, and therefore to know. This is why less is actually more” (58). Reading from a distance with digital tools enables us to see concepts and themes across a large corpus and how they reflect everything from power relationships to vocabulary size to deficiencies in Digital Humanities approaches.

OCR (Optical Character Recognition) and Fraktur

Optical character recognition, or OCR, is the electronic/mechanical process by which images (typed, handwritten, printed) are converted into a machine encoded text. By converting these images into an encoded text, they are “translated” into a searchable text. The conversion or translation of an image to text offers the digital humanist various options for analysis and visualizations, especially a close reading of the text. Related to Discipline and Punish as well as Lawrence Venuti’s concept of translation, the process of conversion enacts “violence” upon the text as images by mangling the words, changing their spelling and also rearranging the order of the text, as if the text itself had been drawn and quartered.

Mangled Text OCR

Yet using software for OCR also allows the digital humanist to work with a larger body of images and texts more quickly than trying to transcribe each image. Although the process itself is faster, transcribing, the conversion/translation is not flawless and it is crucial to edit and correct the text document with the image for accuracy. For our OCR portion of our text analysis project, we used a trial version of ABBYY Finereader Online. Working with ABBYY Finereader was fairly simple and it can be broken down into 3 steps: 1). upload the image to ABBYY and choose the output format file (word doc, txt file, PDF etc.), then 2). hit “recognize” which converts the image into a text document and lastly, 3).download the new file. We justified using ABBYY Finereader because it is one of the few OCR software systems which is compatible with Fraktur. Although ABBYY is compatible with Fraktur, our team had to make many corrections and one correction that really struck me was the lack of recognition for numbers. Every single number came out as a letter, 2 for example came out as a “z”. Some words in the converted document became unintelligible due to umlauts, certain letters (E and S frequently were converted to the symbol “»”) OCR increases legibility:older fonts not commonly used can be changed, illegible documents due to age/faded characters. German students who cannot read Fraktur can now read and search the documents. OCR allows us to be able to interact more in depth, especially digitally, with the text than if we were working with the images by themselves. It took roughly 3-4 hours for the entire OCR process: uploading and recognizing the files, then editing the text files. Correcting and editing the files was by far the most time consuming aspect of OCR. It took our team roughly 2.5-3 hours to edit one document and roughly 30-45 minutes per page. This process speeds up though once the editor begins to see common patterns in the errors that happen during conversion such as: incorrect letters, symbols for certain letters and numbers being converted into letters. Certain words are also consistently incorrect like the word “sich” which converted to “stch” on a regular basis. This can be edited fairly quickly and efficiently by using the find and replace tool in Notepad or Microsoft Word. Since this was our first time editing an OCR document converted from Fraktur we spent more time double checking every word, punctuation mark and line. After working with OCR on some additional documents Although there are many possibilities offered by OCR, there are also various limitations. ABBYY Finereader itself is somewhat accurate, but many errors need to be corrected after conversion. Working with various languages or scripts can also be a limitation because there might only be one or two options of software for OCR. Fraktur was somewhat difficult and ABBYY Finereader was the only software available to work with Fraktur. Another limitation is based on the images which need to be converted. Any rips, tears, smudges, or dirt can drastically change the way the image becomes converted to text.

Voyant

Voyant as a tool of textual analysis offers distinct advantages and disadvantages to scholars and users and readers of texts. One advantage is that it quickly provides summarizes of the formal elements of a group of texts. In our group of texts, the titles of 100 Poor Sinner’s Pamphlets, out of a total 1429 documents, there are 3,715 total words and 1,271 unique word forms. That gives our project a word density (total words/unique words) of 2.92. In comparison to other texts based on word density ours has a high density, meaning a relatively large variety of words (Booth). The reason for this may be that names and locations are often mentioned in the titles of our documents, therefore the vocabulary is relatively large. The word look at the top 10 in Voyant’s collocation also shows a limited lexicon, with only 8 words listed in the first 10 rows of collocates 

 Voyant Collocates

These words, Leben (life), Tod (death), gebracht (brought), geschichtliche (historical/narratively), Darstellung (presentation), worden (were), Schwerdt (sword), wegen (because of), also closely reflect the nature of the 100 documents in our collection. All the words either refer to death and execution, the legal process, and a word like wegen hints  at the reasons for a death sentence, or describing past events, with worden used like the English passive were. The passive is used in reference to someone being executed, as in “Conrad Müller was broken on the wheel” and never mentions the executioner, the state power which tries and executes the criminals in the pamphlets. This obfuscates the way in which power comes to be enacted, while still making its power present. We are not, however, offered any insight into the context of the crimes, the nature of the crimes or the nature of the perpetrators through the collocates. Collocates provide more of broad summary of the content, but not any fine articulations of what is in the content. Foucault wrote that “[t]he archive is the first law of what can be said, the system that governs the appearance of statements as unique events” (30). If we take the group of titles put into Voyant as our “archive” then what can be articulated through our archive via collocation, word count and word variation is limited.

Voyant also allows visualizations of corpora. Johanna Drucker in her monograph Graphesis argues that knowledge can be uniquely and productively presented in visual forms, and such visuals forms can “serve intellectual work, particularly with respects to interpretation” (20). In this aspect, Voyant reflected the same limitations of the its non-visualized analyses. First only two visualizations were legible, the word cloud Cirrus and Links.

 Link Vis

Cirrus Visualization

Other visualizations, such as Bubbleline and Trends, are chronological and since our titles corpus does notcontain dates, were not useful. Others, such as Scatterplot, were so cluttered as to be illegible.

Trends Visualization

The two visualizations we used, Cirrus and Links, worked well to provide a summary of the over 1400 texts, but did not provide a great deal insight beyond themes and writing style. The results were similar to the collocates with many of the same words and relationships visualized. One positive is outcome is that these summaries, textual and visual, allow non-German speakers to understand the broad nature of the texts with minimal translations, i.e. tone and key themes. That extends the audience that can gain some understanding of a large corpus.

Topic Modeling

TopicModelingTool (TMT) is a tool used by digital humanists interested in distant reading; it utilizes algorithms to discover large thematic elements in large collections of texts. For our Topic Modeling project, we focused on the titles of all 1429 documents in the criminology collection.

Before beginning the process of topic modeling, our data had to be properly formatted. In a Google spreadsheet, we created two columns: the first had a sequence of numbers from 1 to 1429; the second contained all of the titles in the subset. Using the find and replace function, we removed all of the punctuation from the document (periods, commas, brackets, etc.) to prevent TMT from differentiating the same word based on surrounding punctuation (e.g., “commended” and “commended,” and “commended.”) Then we downloaded the spreadsheet as a .csv file and used a Python script to create separate text files containing each individual title. Finally, we ran the text files en groupe through TMT. We had created a stop words list that included common insignificant words in German, French, English, and Latin that we wanted to remove from our topic outputs.

Topic Modeling with Python

When we looked at the outputs from this initial attempt, we discovered some serious problems. Single letters would appear in the topics, along with unintelligible words that seemed to have no meaning in any language contained in the archive. After referring back to the titles, we found that often people would be referenced using their initials, “Matthias N.,” for example. To prevent these initials from affecting our analysis, we added the letters a through z to our list of stopwords. To investigate the unintelligible words, we looked at the sections of the texts informing those topics; from this, we found that TMT did not recognize accents or eszetts correctly. To account for this limitation, we replaced any letter with an accent or umlaut with its corresponding non-accented diphthong (e.g., ü = ue, ö = oe, ä = ae, ß = ss) in our spreadsheet, and repeated the exporting process. Topic modeling reveals in microcosm the problems of non-English languages in the digital humanities writ large: at the text-based phases of our overall project, the special characters permeating our archive have added extra characters (data processing), cause data to be truncated (using CSVImport with Omeka), and, here, rendering results that are incomprehensible.

After running the text files through TMT a second time, the vast majority of the words in our output topics looked clean. Due to time constraints and our limited experience with topic modeling, we chose to keep the automatic TMT specifications: 10 topics of 10 words at 200 iterations. The results of our final TMT run are the topics in the image below.

Obviously there are still issues with this list. There is some gibberish (e.g., “dr,” “dn,” “ii”) and a few insignificant words that slipped through (e.g., “is,” “whose,” “and”), issues which would have required a further honing of our stop words list to eliminate. Despite these, we can still get an interesting look at some themes. In topic 2, we see a collection of only Latin words, all relating to the judiciary side of punishment, referencing the law, justice, and debate - these reflect the large number of legal dissertations in the criminology collection In topic four we see a collection of terms concerned with the power of the state (the “holy” referring here to the Holy Roman Empire) and the word “right,” which refers both to law (as in German Recht) as well as the self-proclaimed correctness (“righteousness”) of the sentences. In topic 5 find the more quotidian facts surrounding capital punishment: death, crime and a justification reason for punishment (“because of”), as well as execution date.

Conclusion

This project explores a a few tools and methods for digital text analysis, while seeking to make transparent the different advantages, disadvantages, and choices they entail. While advantages exist, such as a malleably afforded by OCR or the themes brought to light by Topic Modeling, there are disadvantages. A close view of the text disappears, and we as scholars do violence in pulling apart our corpora for the purpose of bringing forth legible results in our analysis. Finding the balance between increasing distant-readability, and respecting sensitive texts such as ours, centered around life and death, remains an ethical and analytical issue.

Works Cited

Blei, David M. "Topic Modeling and Digital Humanities." Journal of Digital Humanities. Web. <http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/>. 

Drucker, Johanna. Graphesis: Visual Forms of Knowledge Production, 2014. Print.

Foucault, Michel. Discipline and Punish: The Birth of the Prison. New York: Vintage Books, 1995. Print.

Moretti, Franco. "Conjectures on World Literature." New Left Review 1.1 (2000): 54. Web.

Simpson, Zachary Booth. "Vocabulary Analysis of Project Gutenberg." Project Guttenberg Vocabulary Analysis. Web. 03 May 2016.

<http://www.mine-control.com/zack/guttenberg/>."William Whitaker's Words." William Whitaker's Words. Web. 27 Apr. 2016. <http://archives.nd.edu/words.html>.