N-gram fetishism

The intertubes are abuzz over the latest & coolest toy released from Google Labs: the Google Books Ngram Viewer. What is it? And why am I writing two posts about Google technology in a single week???

Google Labs' info page explains: "When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., 'British English', 'English Fiction', 'French') over the selected years." ("How" here means "how frequently.")

Brenna Ehrlich wrote about it for Nathan Bransford blogged about it. Patricia Cohen wrote about it for the NY Times. It was a topic of e-mail and conversation and conference calls at my office this week. Geoffrey Nunberg, of UC Berkeley's School of Information, and staff reporter Mark Parry wrote about it in the Chronicle of Higher Education. The most far-reaching claims about n-gram analysis of the Google Books corpora may be last week's article in Science Magazine, Quantitative Analysis of Culture Using Millions of Digitized Books. Nunberg's article considers those claims critically.

Okay, so I tried it out for myself.

What, I wondered, can the Google Books Ngram Viewer tell me about my novel manuscript, Consequence, which (FWIW) is being considered for representation even as I type by literary agents across the continent?

Consequence is about activists who are up in arms over looming environmental catastrophe. Here's an elevator speech: San Francisco political activists get in over their heads when peaceful protest collides with gun runners, road rage, and conspiracy to pilot six kilotons of truck bomb into the heart of a midwestern research facility.

So what does the Google Books Ngram Viewer tell me about how these ideas have been represented in English language books during the period 1930 to 2008? (I'm figuring that period overlaps well with books that folks interested in Consequence are likely to pick up in a bookstore.)

What I see in the graph shown in this post (click the image for a larger view) is that appearance in books of the words catastrophe, terrorism, and consequence have held pretty steady over a period of 78 years. Books that treated the environment spiked in the late 90s and have been falling through the oughts ("environmental" and "ecology" produce graphs in roughly the same shape over this period; "genetic engineering" which is the sort of environmental catastrophe that the novel's activists focus on, graphs with a flatter tail but rises and falls in parallel with "environment").

And that all means ... what?

That people are always up for a story about terrorism and its consequences? That environmental concerns are passe? That environmental concerns are poised for resurgence? That there used to be a glut of books about the environment but now there's space for new books on the topic to emerge? That catastrophe never goes out of fashion?

When I study the graphs (not very hard, I confess) I come to a simple conclusion. Google's ngram viewer has predicted that some people will want to read Consequence if and when it is published, and others won't bother.

Did I need an app for that?

What I predict is that ngram viewing at the level of crude inquiry enabled by the new Google tool -- nifty and fun as it is to play with -- will prove to be just another technological fetish, probably sooner than later.

Geoffrey Nunberg, in his CHE article on Google's Ngram Viewer, links to a May 2010 article in that same journal about Google Books and what it means or doesn't to novels, reading, and humanist scholarship: The Humanities Go Google, by Mark Parry. In it, Professor Katie Trumpener of Yale University (Comp Lit & Film Studies) is quoted describing the kind of analysis made easy by the new ngram viewer as "one that could yield a slew of insignificant numbers with 'jumped-up claims about what they mean.'"


So I think I'll stop here.

What do you think of the ngram viewer? What have you learned from it in the service's first week of operation?

  1. It's fun AND useful--I used it to prove a point to a friend that California is indeed better/more important (or at least written about more) than Ohio (her native state). One problem I do see with the toy/tool is that it doesn't take into account the different definitions and/or contexts of a word. I can't think of a good example right now but there are billions of words that have different meanings in different fields of study, etc..

  2. @Ariel: I think you just refuted your own claim! Was this your proof?

    If so, did you only prove, perhaps, that California is bigger than Ohio, or has been since the early 20th century? That California has been more populated than Ohio for the time period most books mined by Google's NGram were published? That California serves as a setting or subject for more cultural artifacts (like books and movies and books about movies)?

    Do any of these imply "better" or "more important"? In what contexts?