What Happened to Google's Effort to Scan Millions of University Library Books?

It was a crazy idea: Take the bulk of the world’s books, scan them, and create a monumental digital library for all to access. That’s what Google dreamed of doing when it embarked on its ambitious book-digitizing project in 2002. It got part of the way there, digitizing at least 25 million books from major university libraries.

But the promised library of everything hasn’t come into being. An epic legal battle between authors and publishers and the internet giant over alleged copyright violations dragged on for years. A settlement that would have created a Book Rights Registry and made it possible to access the Google Books corpus through public-library terminals ultimately died, rejected by a federal judge in 2011. And though the same judge ultimately dismissed the case in 2013, handing Google a victory that allowed it to keep on scanning, the dream of easy and full access to all those works remains just that.

For more surprising stories at the intersection of tech and education, subscribe to the EdSurge Podcast, a weekly look at how education is changing.

Earlier this year, an article in the Atlantic lamented the dismantling of what it called “the greatest humanistic project of our time.” The author, a programmer named James Somer, put it like this: “Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.”

That assessment may be technically true, but many librarians and scholars see the legacy of the project differently. In fact, academics now regularly tap into the reservoir of digitized material that Google helped create, using it as a dataset they can query, even if they can’t consume full texts. It’s a pillar of the humanities’ growing engagement with Big Data.

It’s also a handy resource for other kinds of research. “It’s hard to imagine going through a day doing the work we academics do without touching something that wouldn’t be there without Google Book Search,” says Paul Courant, now interim provost and executive vice president for academic affairs at the University of Michigan. Courant was also interim provost at Michigan when Google first approached the university about scanning the contents of its library—a proposal that left him both “ecstatic and skeptical,” he says.

“I’m not a fan of everything Google, by any means,” Courant says now. “But I think this was an amazing effort which has had lasting consequences, most of them positive.”

Google’s scanning project helped establish some important nodes in what’s become an ever-expanding web of networked research. As part of the deal, Google’s partner libraries made sure they got to keep digital copies of their scanned works for research and preservation use. That material helped stock a partnership called the the HathiTrust Digital Library. Established in 2008 and based at the University of Michigan, it has grown to include 128 member institutions, according to its executive director, Mike Furlough. It now contains more than 15.7 million volumes. Taking into account multi-volume journals and duplicate copies, that’s about 8 million unique items, about 95 percent of them from Google’s scanning. The rest come from the Internet Archive’s ongoing scanning work and local digitization efforts, according to Furlough.

That rich resource has been put to several good uses. Through the HathiTrust Research Center, scholars can tap into the Google Books corpus and conduct computational analysis—looking for patterns in large amounts of text, for instance—without breaching copyright. And print-disabled users can use assistive technologies to read scanned books that might otherwise be difficult if not impossible to find in accessible formats.

Courant and others involved in the early days of the scanning work acknowledge both the benefits and the shortfalls. “That the universal bookstore-cum-library failed is, to me, a sadness,” he says. And while Google vastly improved its scanning technology as the project went along, it wasn’t ultimately able to resolve a persistent cultural challenge: how to balance copyright and fair use and keep everybody—authors, publishers, scholars, librarians—satisfied. That work still lies ahead.

In spite of the legal wrangling and the failure of the settlement, Mary Sue Coleman considers the project a net gain. Coleman, the current president of the Association of American Universities, was the president of the University of Michigan in the early 2000s when Google co-founder Larry Page, a Michigan alum, approached his alma mater with the scanning idea. Many of the university’s holdings “were invisible to the world,” Coleman says. Google’s involvement promised to change that.

Without Google’s backing and technological abilities, a resource like HathiTrust would have been much harder to create, she says. “We couldn’t have done it without Google,” Coleman says. “The fact that Google did it made things happen much more rapidly, I believe, than it would have happened if universities had been doing it without a central driving force.”

Transforming Scholarship

Ted Underwood’s work is one of the more prominent examples of the kind of scholarship born of Google’s scanning push. Underwood, a professor and LAS Centennial Scholar of English and a professor in the School of Information Sciences at the University of Illinois (and a leading figure in the digital humanities world), describes the effect of Google Books on his scholarship as “totally transformative.” The resources made available by HathiTrust, even those still under copyright, have expanded what he can do and the questions he asks.

“I used to work entirely on the British Romantic period,” Underwood said via email. “Now I spend much of my time studying history broadly across the last two centuries, and the reason is basically Google Books.”

The HathiTrust Research Center allows Underwood and others to work with copyrighted materials. “I can’t physically get the texts under copyright, or distribute them, but I can work inside a secure Data Capsule and measure the things I need to measure to do research,” he says. “So it’s not like my projects have to come to a screeching halt in 1923,” he says. (That’s the year that marks the Great Divide between materials that have come into the public domain and those still locked out of it.)

A Data Capsule is a secure, virtual computer that allows what’s known as “non-consumptive” research, meaning that a scholar can do computational analysis of texts without downloading or reading them. The process respects copyright while enabling work based on copyrighted materials.

For Underwood, that’s made it possible to take on projects like a collaborative study on the gender balance of fiction between 1800 and 2007, conducted with David Bamman of the University of California at Berkeley and Sabrina Lee, also at the University of Illinois. Underwood described the thrust of the work in a blog post last year.

“The headline result we found is that women were equally represented as writers of English-language fiction in the 19th century, and lost ground dramatically in the 20th,” he says. The male-to-female ratio dropped from 1:1 around 1850 to about 3:1 a hundred years later.

“Quite a dramatic change, and in the wrong direction, which seemed so counterintuitive that we didn’t initially believe the results we were getting from HathiTrust,” Underwood says. But a cross-check with Publishers Weekly confirmed the downward slide, which turned around circa 1970, for reasons Underwood and his co-investigators are exploring.

The Networked Library

Google Books and the HathiTrust can also be seen as “signature examples” of how research libraries have evolved beyond thinking of themselves as separate warehouses of knowledge, says Dan Cohen, the recently appointed university librarian at Northeastern University. He’s also vice provost for information collaboration and a professor of history there. Until recently, he was executive director of the Digital Public Library of America, or DPLA.

For those charged with running academic libraries, “there’s really going to be a long-term impact of decentering the library as a stand-alone institution,” Cohen says. That shift corresponds with how researchers operate now. “They’re not expecting to get everything from their home institution,” he says. “They’re expecting that resources will be collectively held and available on the net.”

This expanding digital reality makes it even more important to look critically at the results of Google’s scanning work. Roger C. Schonfeld, director of the libraries and scholarly communication program at the nonprofit Ithaka S&R, is working on a book with Deanna Marcum, former Ithaka S&R managing director and now a senior adviser there, about the Google Books project.

“The question we’re really trying to raise is why did so much of the digitization happen this way, and what other ways could it have happened?” Schonfeld says. Google’s technological and financial muscle sped up the digitizing process enormously, but the company’s priorities weren’t necessarily those of its library partners.

Schonfeld makes the point that as researchers tap into Google Books, it’s essential to ask what selection biases might lurk in the material the project made available. “As anyone doing historical research knows, you can’t ever have all the sources you could possibly wish to have,” Schonfeld says.

To fully judge the value of what came out of Google Books, researchers and librarians need to critically examine what’s been scanned, and from which collections. Not all libraries were included in Google’s project, and no library has everything. “What’s present and what’s absent?” Schonfeld asks. “What are the biases inherent in the creation and selection of the collection?”

Such questions suggest that, on some level, a universal library was always an impossible dream. But Google Books did produce substantial results, even if they are imperfect and incomplete. (One popular tool is the Ngram Viewer, which allows a user to search Google Books data for occurrences over time of specific words.)

Google, for its part, doesn’t say much publicly about the scanning project these days, though the work continues.

"For more than ten years, Google has been committed to increasing the reach of the knowledge and art contained in books by making them discoverable and accessible from a simple query,” Satyajeet Salgar, product manager for Google Books, said via email.” We are continuing to digitize and add books to this world-changing index, improving the quality of our image-processing algorithms and the effectiveness of search, and plan to carry on doing so for years to come. We are proud to continuously make it easier for people to find books to read and conduct deep research using this product."

More digitized content is good. But it may fall on universities and libraries to figure out how to carry forward the campaign to make that content most usable.

As Paul Courant points out, “the big problem is not further digitization” but access. HathiTrust prevailed in a separate fair-use lawsuit brought by authors and publishers. But too much remains locked up, Courant says, and the problem of orphan works—those whose copyright status is murky—is yet to be solved.

For Mike Furlough, HathiTrust’s executive director, it’s up to the library community to figure out where to go with what Google helped start. He points to an evolving national digital infrastructure, funded in part by entities like the federal Institute of Museum and Library Services as well as private groups like the Andrew W. Mellon Foundation and the Sloan Foundation.

By pushing digitization, Google Books has helped print collections, too. Under HathiTrust’s Shared Print program, some library members of the consortium agree to hold onto a print copy of each digitized monograph. “We’re not saying that digital is enough,” Furlough says. “We’re saying that digital is a complement. We don’t think print ever goes away.”

Google’s scanning work “has been an incredible boost,” Furlough says. “What remains is to figure out what remains. It does not get us all the way to the end.”