Friday, August 04, 2006

A new news research tool

Researchers at UC Irvine have developed a way to use text mining - or a version called 'topic modeling' - to derive relevant information from the archives of articles in the New York Times.

As news archives have matured since their beginnings in the 70s and 80s, they've become huge databases. The Times may be the biggest, having been online since 1980, rivalled by the Washington Post, online in full text since 1977. (or even some of the former Knight Ridder newspapers, Philadelphia (1979), Detroit (1980) and the Miami Herald (1982). (I may be a year off with a couple of these but you get the drift.) If you really need to compile years worth of incidents from these databases, in Nexis or other service, it can take hours of frustrating work, weeding out the story hits that aren't relevant.

This study used only two years of the Times' database, but analyzed 330,000 stories in 'just a few hours'. Hmm, is that fast enough?

The Resourceshelf posting I found this at has links to the study and the researchers. Here's the nut graf, from the press release, for news researchers:
Performing what a team of dedicated and bleary-eyed newspaper librarians would need months to do, scientists at UC Irvine have used an up-and-coming technology to complete in hours a complex topic analysis of 330,000 stories published primarily by The New York Times.

(Added later:) More on data mining, from Depth Reporting: Ars Technica reports on a project to mine the entire Congressional Record archive. And Google is releasing a database of a trillion words used on Websites, for researchers to use. It'll fill 6 DVDs. Wow.


Post a Comment

<< Home