Research Article: Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web

Date Published: October 31, 2008

Publisher: Public Library of Science

Author(s): Duncan Hull, Steve R. Pettifer, Douglas B. Kell, Johanna McEntyre

Abstract: Many scientists now manage the bulk of their bibliographic information electronically, thereby organizing their publications and citation material from digital libraries. However, a library has been described as “thought in cold storage,” and unfortunately many digital libraries can be cold, impersonal, isolated, and inaccessible places. In this Review, we discuss the current chilly state of digital libraries for the computational biologist, including PubMed, IEEE Xplore, the ACM digital library, ISI Web of Knowledge, Scopus, Citeseer, arXiv, DBLP, and Google Scholar. We illustrate the current process of using these libraries with a typical workflow, and highlight problems with managing data and metadata using URIs. We then examine a range of new applications such as Zotero, Mendeley, Mekentosj Papers, MyNCBI, CiteULike, Connotea, and HubMed that exploit the Web to make these digital libraries more personal, sociable, integrated, and accessible places. We conclude with how these applications may begin to help achieve a digital defrost, and discuss some of the issues that will help or hinder this in terms of making libraries on the Web warmer places in the future, becoming resources that are considerably more useful to both humans and machines.

Partial Text: The term digital library[2]–[4] denotes a collection of literature and its attendant metadata (data about data) stored electronically. According to Herbert Samuel, a library is “thought in cold storage” [5], and unfortunately digital libraries can be cold, isolated, impersonal places that are inaccessible to both machines and people. Many scientists now organize their knowledge of the literature using some kind of computerized reference management system (BibTeX, EndNote, Reference Manager, RefWorks, etc.), and store their own digital libraries of full publications as PDF files. However, getting hold of both the data (the actual publication) and the metadata for any given publication can be problematic because they are often frozen in the isolated and icy deposits of scientific publishing. Because each library and publisher has different ways of identifying and describing their metadata, using digital libraries (either manually or automatically) is much more complicated than it needs to be [6], and with papers in the life sciences alone (at Medline) being published at the rate of approximately two per minute [7], only computerized analyses can hope to be reasonably comprehensive. What then, are these digital libraries, and what services do they provide?

Because computational biology is an interdisciplinary science, it draws on many different sources of data, information, and knowledge. Consequently, there exists a range of digital libraries on the Web identified by URIs [25] and/or DOIs [55],[56] that a typical user requires, each with its own speciality, classification, and culture, from computer science through to biomedical science. DOIs are a specific type of URI and similar to the International Standard Book Numbers (ISBN), allowing persistent and unique identification of a publication (or indeed part of a publication), independently of its location. The range of libraries currently available on the Web is described below, starting with those that focus on specific disciplines (such as ACM, IEEE, and PubMed) through to libraries covering a broader range of scientific disciplines, such as ISI WOK and Google Scholar. For each library, we describe the size, coverage, and style of metadata used (summarized in Table 1 and Figure 2). Where available, DOIs can be used to retrieve metadata for a given publication using a DOI resolver such as CrossRef [57], a linking system developed by a consortium of publishers. We illustrate with specific examples how URIs and DOIs are used by each library to identify, name, and locate resources, particularly individual publications and their author(s). We often take URIs for granted, but these humble strings are fundamental to the way the Web works [58] and how libraries can exploit it, so they are a crucial part of the cyberinfrastructure [59] required for e-science on the Web. It is easy to underestimate the value of simple URIs, which can be cited in publications, bookmarked, cut-and-pasted, e-mailed, posted in blogs, added to Web pages and wikis [60]–[62], and indexed by search engines. Simple URIs are a key part of the current Web (version 1.0) and one of the reasons for the Web’s phenomenal success since appearing in 1990 [63]. As we shall demonstrate with examples, each digital library has its own style of URI for being linked to (inbound links) and alternative styles of URI for linking out (outbound links) to publisher sites. Some of these links are simple, others more complex, and this has important consequences for both human and programmatic access to the resources these URIs identify.

The digital libraries outlined in the previous section all differ in their coverage, access, and features, but the abstract process of using them is more standard. Figure 4 shows an abstract workflow for using any given digital library. We do not propose this as a universal model, which every user will follow, but provide it to illustrate some of the problems with managing data and metadata in the libraries described in the previous section on digital libraries.

Although libraries can be cold, the tools described in this section could potentially make them much warmer. They do this in two main ways. Personalization allows users to say this is my library, the sources I am interested in, my collection of references, as well as literature I have authored or co-authored. Socialization allows users to share their personal collections and see who else is reading the same publications, including added information such as related papers with the same keyword (or “tag”) and what notes other people have written about a given publication. The ability to share data and metadata in this way is becoming increasingly important as more and more science is done by larger and more distributed teams [142] rather than by individuals. Such social bookmarking is already available on the Web site of publications such as the Proceedings of the National Academy of Sciences ( and the journals published by Oxford University Press.

The software described in the section Some Tools for Defrosting Libraries are a promising start to improving the digital library. They make data and metadata more integrated, personal, and sometimes more sociable. While they are a promising start, they face considerable obstacles to further success.

Warmer digital libraries cannot be achieved by software tools alone. The digital libraries themselves can take simple steps to make data and metadata more amenable to human and automated use, making their content more useful and useable. Only with proper and better access to linked data and metadata can the tools that computational biologists require be built. We make the following recommendations to achieve this goal.

The future of digital libraries and the scientific publications they contain is uncertain. Rumours of the death of printed books [200] and the death of the journal [201] have (so far) been greatly exaggerated. In scientific publishing, we are beginning to see books and electronic journals becoming more integrated with databases, blogs, and other digital media on the Web. These and other changes could lead to a resurgence in the role of nonprofit professional societies and institutional libraries in the scientific enterprise [104] as the cost of publishing falls. But the outcome is still far from certain.