text mining

Digital Humanities Is for Humans, Not Just Humanists: Social Science and DH, by Kitty O’Riordan

In an article published online last month by The Guardian—“AI programs exhibit racial and gender biases, research reveals”—the computer scientists behind the technology were careful to emphasize that this reflects not prejudice on the part of artificial intelligence, but AI’s learning of our own prejudices as encoded within language.

“Word embedding”, “already used in web search and machine translation, works by building up a mathematical representation of language, in which the meaning of a word is distilled into a series of numbers (known as a word vector) based on which other words most frequently appear alongside it. Perhaps surprisingly, this purely statistical approach appears to capture the rich cultural and social context of what a word means in the way that a dictionary definition would be incapable of.”

This tool’s ability to reproduce complex and nuanced word associations is probably not surprising to anyone familiar with digital humanities—and the fact that it returned associations that match pleasant words with whiteness and unpleasant ones with blackness, or that associate “woman” with the arts and interpretative disciplines and “man” with the STEM fields shouldn’t be surprising to anyone who has been paying attention. The distressing prospect that AI and other digital programs and platforms will only reinforce existing bias and inequality has certainly garnered the attention of scholars in media studies and DH, but one could argue that it has received equal attention in the social sciences.

As a graduate student in cultural anthropology drawn to DH, I sometimes find myself considering what exactly demarcates digital humanities from social science when apprehending these kinds of topics; somehow, with the addition of ‘digital’, the lines seem to have blurred. Both ultimately represent an investigation of how humans create meaning through or in relation to the digital universe, and the diverse methodologies at the disposal of each are increasingly overlapping. Below are just a few reasons, from my limited experience, as to why social scientists can benefit from involvement with digital humanities—and vice-versa.

1) Tools developed in DH can serve as methodologies in the social sciences.

Text mining, a process that derives patterns and trends from textual sources similar to the phenomenon described above, is particularly suited for social science analysis of primary sources. Programs like Voyant and Textalyser are free and easily available on the web, no downloads or installations required, and can pull data from PDFs, URLs, and Microsoft Word, plain text and more. Interview transcripts can also be analyzed using these programs, and the graphs and word clouds they create provide a unique way to “see” an argument, a theme, bias, etc.

Platforms like Omeka and Scalar can provide an opportunity not only to display ethnographic information for visual anthropologists, but can give powerful form to arguments in a way that textual forms cannot (see, for example, Performing Archive: Curtis + “the vanishing race”, which turns Edward S. Curtis’ famous photos of Native Americans on their heads by visualizing the categories instead of the categorized).

2) Both fields are tackling the same issues.

Miriam Posner writes that she “would like us to start understanding markers like gender and race not as givens but as constructions…I want us to stop acting as though the data models for identity are containers to be filled in order to produce meaning and recognize instead that these structures themselves constitute data.” Drucker and Svensson echo that creating data structures that expose inequality or incorporate diversity is not as straightforward as it seems, given that “the organization of the fields and tag sets already prescribes what can be included and how these inclusions are put into signifying relations with each other” (10). Anthropologist Sally Engle Merry, in The Seductions of Quantification, expounds on this idea in the realm of Human Rights, proving that indicators can obscure as much or more than they reveal. Alliances between DHers as builders and analyzers of digital tools and platforms, and social scientists as suppliers of information on the effects of these on the ground in various cultural contexts, provide benefit to both.

3) Emerging fields in the social sciences can learn a lot from established DH communities and scholarship.


Digital anthropology
, digital sociology, cyberanthropology, digital ethnography, and virtual anthropology are all sub-disciplines emerging from the social sciences with foci and methods that often overlap with those of digital humanities. Studies of Second Life, World of Warcraft, or hacking; the ways diasporic communities use social media platforms to maintain relationships; or projects that focus on digitizing indigenous languages all have counterparts within digital humanities.  Theoretically, there is much to compare: Richard Grusin’s work on mediation intersects with
anthropologists leading the “ontological turn” like Philippe Descola and Eduardo Viveiros de Castro; Florian Cramer’s work on the ‘post-digital’ pairs interestingly with Shannon Lee Dawdy’s concept of “clockpunk” anthropology, influenced by thinkers both disciplines share like Walter Benjamin and Bruno Latour.

Though I am still relatively new to DH, one theme I find repeated often, and which represents much of the promise and the excitement of digital humanities for me, is the push for collaboration and the breaking down of disciplinary boundaries. Technologies like AI remind us that we all share the collective responsibility to build digital worlds that don’t simply reflect the restrictions and biases of our textual and social worlds.

 

Kitty O’Riordan is a doctoral student in cultural anthropology at the University of Connecticut. Her research interests include anthropology of media and public discourse, comparative science studies, and contemporary indigenous issues in New England. You can reach her at caitlin.o’riordan@uconn.edu.

Visualizing English Print at the Folger, by Gregory Kneidel (cross-post with Ruff Draughts)

Screen Shot 2017-03-17 at 3.01.53 PMIn December I spent two days at the Folger’s Visualizing English Print seminar. It brought together people from the Folger, the University of Wisconsin, and the University of Strathclyde in Glasgow; about half of us were literature people, half computer science; a third of us were tenure-track faculty, a third grad students, and a third in other types of research positions (i.e., librarians, DH directors, etc.).

Over those two days, we worked our way through a set of custom data visualization tools that can be found here. Before we could visualize, we needed and were given data: a huge corpus of nearly 33,000 EEBO-TCP-derived simple text files that had been cleaned up and spit through a regularizing procedure so that it would be machine-readable (with loss, obviously, of lots of cool, irregular features—the grad students who wanted to do big data studies of prosody were bummed to learn that all contractions and elisions had been scrubbed out). They also gave us a few smaller, curated corpora of texts, two specifically of dramatic texts, two others of scientific texts. Anyone who wants a copy of this data, I’d be happy to hook you up.vep_1

From there, we did (or were shown) a lot of data visualization. Some of this was based on word-frequency counts, but the real novel thing was using a dictionary of sorts called DocuScope—basically a program that sorts 40 million different linguistic patterns into one of about 100 specific rhetorical/verbal categories (DocuScope was developed at CMU as a rhet/comp tool—turned out not to be good at teaching rhet/comp, but it is good at things like picking stocks). DocuScope might make a hash of some words or phrases (and you can revise or modify it; Michael Witmore tailored a DocuScope dictionary to early modern English), but it does so consistently and you’re counting on the law of averages to wash everything out.

After drinking the DocuScope Kool-Aid, we learned how to visualize the results of DocuScoped data analysis. Again, there were a few other cool features and possibilities, and I only comprehended the tip of the data-analysis iceberg, but basically this involved one of two things.

  • Using something called the MetaData Builder, we derived DocuScope data for individual texts or groups of texts within a large corpus of texts. So, for example, we could find out which of the approximately 500 plays in our subcorpus of dramatic texts is the angriest (i.e., has the greatest proportion of words/phrases DocuScope tags as relating to anger)? Or, in an example we discussed at length, within the texts in our science subcorpus, who used more first-person references, Boyle or Hobbes (i.e., which had the greater proportion of words/phrases DocuScope tags as first-person references). The CS people were quite skilled at slicing, dicing, and graphing all this data in cool combinations. Here are some examples. A more polished essay using this kind of data analysis is here. So this is the distribution of DocuScope traits in texts in large and small corpora.
  • We visualized the distribution of DocuScope tags within a single text using something called VEP Slim TV. Using Slim TV, you can track the rise and fall of each trait within a given text AND (and this is the key part) link directly to the text itself. So, for example, this is an image of Margaret Cavendish’s Blazing-World (1667).

vep_2

 

Here, the blue line in the right frame charts lexical patterns that DocuScope tags as “Sense Objects.”

vep_3vep_4

 

 

 

 

 

 

 

 

 

 

 

 

 

The red line charts lexical patterns that DocuScope tags as “Positive Standards.” You’ll see there is lots of blue (compared to red) at the beginning of Cavendish’s novel (when the Lady is interviewing various Bird-Men and Bear-Men about their scientific experiments), but one stretch in the novel where there is more red than blue (when the Lady is conversing with Immaterial Spirits about the traits of nobility). A really cool thing about Slim TV that could make it useful in the classroom: you can move through and link directly to the text itself (that horizontal yellow bar on the right shows which section of the text is currently being displayed).

So 1) regularized EEBO-TCP texts turned into spreadsheets using 2) the DocuScope dictionary; then use that data to visualize either 3) individual texts as data points within a larger corpus of texts or 4) the distribution of DocuScope tags within a single text.

Again, the seminar leaders showed some nice examples of where this kind of research can lead and lots of cool looking graphs. Ultimately, some of the findings were, if not underwhelming, at least just whelming: we had fun discussing the finding that, relatively speaking, Shakespeare’s comedies tend to use “a” and his tragedies tend to use “the.” Do we want to live in a world where that is interesting? As we experimented with the tools they gave us, at times it felt a little like playing with a Magic 8 Ball: no matter what texts you fed it, DocuScope would give you lots of possible answers, but you just couldn’t tell if the original question was important or figure out if the answers had anything to do with the question. So formulating good research questions remains, to no one’s surprise, the real trick.

A few other key takeaways for me:

1) Learn to love csv files or, better, learn to love someone from the CS world who digs graphing software;

2) Curated data corpora might be the new graduate/honors thesis. Create a corpora (e.g.s, sermons, epics, travel narratives, court reports, romances), add some good metadata, and you’ve got yourself a lasting contribution to knowledge (again, the examples here are the drama corpora or the science corpora). A few weeks ago, Alan Liu told me that he requires his dissertation advisees to have a least one chapter that gets off the printed page and has some kind of digital component. A curated data collection, which could be spun through DocuScope or any other kind of textual analysis program, could be just that kind of thing.

3) For classroom use, the coolest thing was VEP Slim TV, which tracks the prominence of certain verbal/rhetorical features within a specific text and links directly to the text under consideration. It’s colorful and customizable, something students might find enjoyable.

All this stuff is publicly available as well. I’d be happy to demo what we did (or what I can do of what we did) to anyone who is interested.

Gregory Kneidel is Associate Professor of English at the Hartford Campus. He specializes in Renaissance poetry and prose, law and literature, and textual editing. He can be reached at gregory.kneidel@uconn.edu.