Monday, December 20, 2010

Reading level in Google search

This isn't my breaking news: Barry Schwartz of searchengineland.com noticed it ten days ago and the news was slashdotted Thursday, and a friend did me the favor of setting up a link to how One Finger Typing blog posts rate on Google's reading-level scale. Nundu, a Google employee with a very thin profile, announced the new feature on 9 December in the Google Web Search Help Forum.

As you can see from the screenshot, 25% of this blog's post appear to score as "Basic" and 75% as "Intermediate" reading level.

Alas, the statistical display is actually misleading. Only some of this blog's posts are classified. Four, to be exact, as of mid-December 2010.

The first thing I wonder when I see this sort of claim, involving automated classification, is what methodology was used. That is, how does Google Search measure reading level measured?

Nundu, in that help forum post, says: "Re: how we classify pages -- The feature is based primarily on statistical models we built with the help of teachers. We paid teachers to classify pages for different reading levels, and then took their classifications to build a statistical model. With this model, we can compare the words on any webpage with the words in the model to classify reading levels. We also use data from Google Scholar, since most of the articles in Scholar are advanced."

Okay, that's pretty cool. Teachers rate texts, professors write texts, then Google's Natural Language Processing magic matches search result content to content of known or rated texts. Those that are similar to "basic" texts are themselves rated "basic," and so forth. There are a fair few known and reliable ways to measure 'similarity' between texts, and while Google doesn't like to tell which it uses I'm fairly confident that they know what they're doing -- what with all the content at their disposal, and their strong financial interest in getting 'similarity' right in order to successfully present search results.

Reading level analysis has been a part of Google Docs for a while. If you've got a document in Google Docs, and use the Word Count tool, numeric ratings are given on three different scales of "readability": Flesch Reading Ease & Flesch-Kincaid Grade Level and the Automated Readability Index.

For example, one of the search-classified posts on One Finger Typing -- Nominative determinism in fiction -- is given a Flesch Reading Ease score of 61.13 by Google Docs, indicating it's "easily understandable by 13- to 15-year-old students." That's about on a par with Reader's Digest, but easier to digest than a Time magazine article, or so says Wikipedia. Google Search classifies this score as "Basic."

I'm curious to know the full distribution of classifications across my blog posts to-date, but I guess Google's servers have to chew over the intertubes a bit more before that's made easy as a few keystrokes. I'm not so eager to stare at my own blog's navel that I'm going to iterate through every one of my posts with the Word Count feature in Google Docs.

In the meantime, it's good to know that I'm not pitching prose exclusively to university professors. One hopes to be more broadly accessible than that, no offense intended to any university professor who ever has or might ever read & comment on One Finger Typing.


Related posts on One Finger Typing:
Google's new Blogger interface
Google yanks APIs, developers caught with pants around ankles
N-gram fetishism

4 comments:

  1. It's not without its wonkiness-- the one Crescat Graffiti page that's ranked "advanced" has practically no text on it. On the upside, I'm relieved to see that my BA paper is advanced.

    ReplyDelete
  2. Dude, my score is totally flipped on yours. I'm, like, 73% basic, an' only 26% intermediate.

    ReplyDelete
  3. @Glenn -- So you might say that you're, basically, a flippin' Google genius? Bring on the readers....

    ReplyDelete
  4. @Quinn -- (sorry, you got caught in blogspot's spam filter for a day or so there...) --

    Let's face it. If Google applied less than an advanced rating to something that begins like this:

    The velar plus front vowel combinations seen in such examples as kělě and xěrъ, which do not seem to show the effect of the second regressive palatalization, are one of the most distinctive features of the dialect of Old East Slavic spoken in Novgorod.

    ... well, they'd have to pull the plug on the reading levels feature.

    If I ask what a "second regressive palatization" is and you answer, will I understand?

    ReplyDelete