Charles Petzold



Google Books Needs to Know What They Have

September 10, 2007
New York, N.Y.

I don't know if anybody remembers this any more, but back in the mid-to-late1990s there was actually a debate about the relative utility of web directories and web search engines. Yahoo, for example, offered a human-edited directory that separated the web's contents into categories and sub-categories, much like the systems used to categorize books in a library. Early web search engines tended to be less useful because they could easily be tricked. It wasn't until search engines became much more sophisticated that web directories began seeming unnecessarily limited in scope. This is Google's claim to fame.

Web search is now so sophisticated and so dominant that the very idea of revisiting the directory concept seems perverse. Yet, a human-edited directory is what may be required to transform Google Books into an actual usable online library.

As an example, let's take philosopher John Stuart Mill's 1843 book A System of Logic, which was a highly influential book in the 19th century but isn't much available these days. (Although Mill's book was published just four years before George Boole's pamplet The Mathematical Analysis of Logic, the books are very different: Mill's book concerns epistemology and scientific induction.)

When I say that A System of Logic was published in 1843, I mean of course that the first edition was published that year. Like many popular books, A System of Logic was subsequently published in later editions, sometimes with corrections or enhancements. If you wanted to know Mill's final thoughts on the subject, you might consult the last edition he was involved in preparing. If you wanted to get a better feel for the book's early impact on 19th century intellectual thought, consulting the first edition would be appropriate. If you were researching the life of someone who read the book, you might want to track down the particular edition your subject read.

This gets a little messier because A System of Logic was first published in London but was later published in New York, so you also need to be aware where the edition was published as well as when.

A System of Logic was a rather lengthy book so many of the editions were published in two volumes. The two volumes are not considered to be different books, but part of the same book.

If you have any experience using Google Books, you're probably already cringing, but let's be brave. Go to the Advanced Book Search page, and put "System of Logic" into the Title field and "John Stuart Mill" into the Author field. Or try this:

http://books.google.com/books?q=intitle:"system of logic" inauthor:"john stuart mill"

I get three hits:

Notice that the full text of the second item isn't available because it's a recent publication. But it's nice to know that a modern scholarly edition exists. The third item is one of the benefits of search: We've turned up a related book of potential interest by William Whewell. (Listing Mill as a co-author of Whewell's book is an error, although the book wouldn't have turned up in this search without it!) But to get at the myriad editions of Mill's book you must click "More editions" under the first item.

And then you get 22 items; some of them are recent reprints but many of them are old. Fortunately, many dates are present in this list: 1858, 1869, 1843, 1862, 1865, 1868, and so forth, but in no particular order. To actually determine what editions these are — and whether the edition was published in London or New York — you need to select each item and then look at the title page.

For example, the item dated 1862 is the 5th London edition. But it's only Volume One! Where on earth is Volume Two? More digging is required.

Wouldn't it be nice if searching for this book led you to a page with a simple list of the various editions, where they were published and when, with links to Volume One and Volume Two of each edition? Isn't that really the minimum we should expect of an online library?

But wait! Someone's already done some of the work. In the Wikipedia entry for A System of Logic, one of the indefatigable Wikipedians has attempted to categorize Google Books' holdings:

Google Books is revealing the limitations of search. The search discovers 22 books that match the title "System of Logic" and author of "John Stuart Mill" and then it just lumps them into a group — seeming haphazardly. Google Books needs to know that there is an entity independent of this search called A System of Logic by John Stuart Mill, and that this book has a publication history revealed by the myriad copies of a book with that title.

Google Books needs to take an extra step: When two or more scanned books have the same title and author, an examination must occur: Are these entirely different books that coincidently share title and author? Are they copies of the same book? Are they different editions of the same book? Are they the same edition but different printings? Are they complementary volumes of the same edition?

I'm sure much of this process can be automated, but sometimes an actual human being (preferably a Library Science major) will have to look at the books and make a decision.