Friday, September 24, 2010

Google Books and the Importance of Good Metadata

(It's Friday, and so it's time for me to wave the librarian/metadata flag once more. I wonder what a flag like that would look like? MARC fields? If you have a thought about that, let me know.)

If you keep track of any metadata-related librarian blogs, I feel sure that you've seen Laura Miller's article on Salon - The Trouble with Google Books. The Google Books project has been fraught with problems since its inception, chiefly licensing rights associated with the scanned works still under copyright. We've all heard about this, and I think that the issues surrounding rights resolutions will continue to be dealt with for years to come. What appealed to me about this article was that it takes one of the fundamental problems associated with Google Books - metadata and searching - and explains it in plain English. Librarians, and especially those of us that catalog, tend to use a great deal of acronyms and jargon in speaking to one another (MARC21, XML, DCMI, LCCS, DDC, LCSH, Authorities, Main Entry, Chief Source of Information, etc, etc.) so it's nice to see the problems highlighted and well illustrated in the article.

paperfile_screenshot

Rather than re-hashing what Ms. Miller states in the article, allow me to make two points based around the article.

First, Full-text searching is not the equivalent of searching good metadata. Let me highlight this with a quote from the text:

Nunberg, a linguist interested in how word usage changes over time, noticed "endemic" errors in Google Books, especially when it comes to publication dates. A search for books published before 1950 and containing the word "Internet" turned up the unlikely bounty of 527 results. Woody Allen is mentioned in 325 books ostensibly published before he was born.

Other errors include misattributed authors -- Sigmund Freud is listed as a co-author of a book on the Mosaic Web browser and Henry James is credited with writing "Madame Bovary." Even more puzzling are the many subject misclassifications: an edition of "Moby Dick" categorized under "Computers," and "Jane Eyre" as "Antiques and Collectibles" ("Madame Bovary" got that label, too).


Intelligently crafted and applied metadata would have helped this situation. Simply because a book mentions cattle a certain number of times does not mean the item is about cattle, indeed, it might be about something else entirely different. Also, full-text searching cannot discriminate between differing editions of items - something very useful in scholarly research. I think Google sees books (on some level) as websites, and that their (relative) effectiveness in creating a method by which those websites are indexed could be applied to books. This is simply not the case, as is clearly demonstrated in the article. Indeed, I think many people assume that the functionality of a catalog and of Google's search algorithms are close to the same.

Second, as I have said before, this article highlights the need for good catalogers in libraries of all types. Simply because it's already done doesn't mean the cataloging is great - I run into this all the time while copy cataloging at the Carter. Again, this is highlighted in a quote from the article:

Several people at Google took pains to respond to my original blog posting about this issue, and they claim that many of these errors originated with the providers (libraries or commercial services hired to provide metadata about books), not Google. It's true that no metadata source is perfect. The Harvard Library makes mistakes, too. But nothing on the scale I found in Google Books. The Harvard Library does not have Henry James as the author of "Madame Bovary."


Simply because someone else has "done" the job, doesn't mean they did a good job, and the other party creating the metadata is simply unfamiliar with your own metadata needs. Of course, my second point (specificity of metadata to a collection) is less of an issue with a one for all collection, like Google Books. However, there should still be consistent practices for the creation of metadata, because as I have said before, consistent metadata is the key to improving access for your users - consistent searching and results.

What issues do you have with the Google Books metadata, or in general? What did Google do well that libraries should emulate? Let me know, and have a great weekend!

No comments:

Post a Comment