data mining

Visualizing English Print at the Folger, by Gregory Kneidel (cross-post with Ruff Draughts)

Screen Shot 2017-03-17 at 3.01.53 PMIn December I spent two days at the Folger’s Visualizing English Print seminar. It brought together people from the Folger, the University of Wisconsin, and the University of Strathclyde in Glasgow; about half of us were literature people, half computer science; a third of us were tenure-track faculty, a third grad students, and a third in other types of research positions (i.e., librarians, DH directors, etc.).

Over those two days, we worked our way through a set of custom data visualization tools that can be found here. Before we could visualize, we needed and were given data: a huge corpus of nearly 33,000 EEBO-TCP-derived simple text files that had been cleaned up and spit through a regularizing procedure so that it would be machine-readable (with loss, obviously, of lots of cool, irregular features—the grad students who wanted to do big data studies of prosody were bummed to learn that all contractions and elisions had been scrubbed out). They also gave us a few smaller, curated corpora of texts, two specifically of dramatic texts, two others of scientific texts. Anyone who wants a copy of this data, I’d be happy to hook you up.vep_1

From there, we did (or were shown) a lot of data visualization. Some of this was based on word-frequency counts, but the real novel thing was using a dictionary of sorts called DocuScope—basically a program that sorts 40 million different linguistic patterns into one of about 100 specific rhetorical/verbal categories (DocuScope was developed at CMU as a rhet/comp tool—turned out not to be good at teaching rhet/comp, but it is good at things like picking stocks). DocuScope might make a hash of some words or phrases (and you can revise or modify it; Michael Witmore tailored a DocuScope dictionary to early modern English), but it does so consistently and you’re counting on the law of averages to wash everything out.

After drinking the DocuScope Kool-Aid, we learned how to visualize the results of DocuScoped data analysis. Again, there were a few other cool features and possibilities, and I only comprehended the tip of the data-analysis iceberg, but basically this involved one of two things.

  • Using something called the MetaData Builder, we derived DocuScope data for individual texts or groups of texts within a large corpus of texts. So, for example, we could find out which of the approximately 500 plays in our subcorpus of dramatic texts is the angriest (i.e., has the greatest proportion of words/phrases DocuScope tags as relating to anger)? Or, in an example we discussed at length, within the texts in our science subcorpus, who used more first-person references, Boyle or Hobbes (i.e., which had the greater proportion of words/phrases DocuScope tags as first-person references). The CS people were quite skilled at slicing, dicing, and graphing all this data in cool combinations. Here are some examples. A more polished essay using this kind of data analysis is here. So this is the distribution of DocuScope traits in texts in large and small corpora.
  • We visualized the distribution of DocuScope tags within a single text using something called VEP Slim TV. Using Slim TV, you can track the rise and fall of each trait within a given text AND (and this is the key part) link directly to the text itself. So, for example, this is an image of Margaret Cavendish’s Blazing-World (1667).

vep_2

 

Here, the blue line in the right frame charts lexical patterns that DocuScope tags as “Sense Objects.”

vep_3vep_4

 

 

 

 

 

 

 

 

 

 

 

 

 

The red line charts lexical patterns that DocuScope tags as “Positive Standards.” You’ll see there is lots of blue (compared to red) at the beginning of Cavendish’s novel (when the Lady is interviewing various Bird-Men and Bear-Men about their scientific experiments), but one stretch in the novel where there is more red than blue (when the Lady is conversing with Immaterial Spirits about the traits of nobility). A really cool thing about Slim TV that could make it useful in the classroom: you can move through and link directly to the text itself (that horizontal yellow bar on the right shows which section of the text is currently being displayed).

So 1) regularized EEBO-TCP texts turned into spreadsheets using 2) the DocuScope dictionary; then use that data to visualize either 3) individual texts as data points within a larger corpus of texts or 4) the distribution of DocuScope tags within a single text.

Again, the seminar leaders showed some nice examples of where this kind of research can lead and lots of cool looking graphs. Ultimately, some of the findings were, if not underwhelming, at least just whelming: we had fun discussing the finding that, relatively speaking, Shakespeare’s comedies tend to use “a” and his tragedies tend to use “the.” Do we want to live in a world where that is interesting? As we experimented with the tools they gave us, at times it felt a little like playing with a Magic 8 Ball: no matter what texts you fed it, DocuScope would give you lots of possible answers, but you just couldn’t tell if the original question was important or figure out if the answers had anything to do with the question. So formulating good research questions remains, to no one’s surprise, the real trick.

A few other key takeaways for me:

1) Learn to love csv files or, better, learn to love someone from the CS world who digs graphing software;

2) Curated data corpora might be the new graduate/honors thesis. Create a corpora (e.g.s, sermons, epics, travel narratives, court reports, romances), add some good metadata, and you’ve got yourself a lasting contribution to knowledge (again, the examples here are the drama corpora or the science corpora). A few weeks ago, Alan Liu told me that he requires his dissertation advisees to have a least one chapter that gets off the printed page and has some kind of digital component. A curated data collection, which could be spun through DocuScope or any other kind of textual analysis program, could be just that kind of thing.

3) For classroom use, the coolest thing was VEP Slim TV, which tracks the prominence of certain verbal/rhetorical features within a specific text and links directly to the text under consideration. It’s colorful and customizable, something students might find enjoyable.

All this stuff is publicly available as well. I’d be happy to demo what we did (or what I can do of what we did) to anyone who is interested.

Gregory Kneidel is Associate Professor of English at the Hartford Campus. He specializes in Renaissance poetry and prose, law and literature, and textual editing. He can be reached at gregory.kneidel@uconn.edu.