A library must have a collection of materials that carry information. In addition to its collection, a library must have some kind of organizational rules and some kind of finding mechanisms, collectively known as its "technologies" (e.g., catalogs, search engines). Finally, every library serves one or more identifiable communities of users. A library becomes digital as the collection, the technologies, and the relation to the users are converted from printed formats (i.e., books and paper) to electronic formats. First, the collection itself must be made machine readable. Next, the technologies must be converted to computer-based forms. Finally, an interface to members of the user community must be provided in computer formats. A digital library may be as small as the set of files on one person's computer, organized into a hierarchical directory structure and supplemented by the owner's personal scheme for naming files and directories. In such a system, the meaning of the path name "c:/documents/personal/smith.david/to/2000April2 4" is clear to the owner of the system. However, if the owner wished to locate the letter in which "I wrote to David Smith about Aunt Martha," this naming scheme might not be adequate. If the system were used by many people, the problem would be complicated further by the possibility of there being more than one David Smith. The problem of indexing or organizing by content exists for every digital library, from the smallest to the largest. The key technology for solving this problem is called "information retrieval."
The purpose of a library, viewed in the broadest sense, is to facilitate communication across space and time by selecting, preserving, organizing, and making accessible documents of all kinds. Digital libraries provide many opportunities to improve upon paper libraries. For example, methods of information retrieval make it possible to index books at the level of chapters, or even at the level of sections and paragraphs. However, just as a paper library can provide too many books, digital libraries can provide an even greater over-abundance of documents, chapters, and passages. This calls for improvements at the micro level (e.g., better identification of relevant passages) and at the macro level (e.g., effective organization of vast networks into useful and usable logical collections). In addition, the old problem of preserving fragile paper materials is replaced by the problem of maintaining usability, given the potential transience of particular modes of storage (e.g., magnetic tape, diskette, CD-ROM, DVD). Libraries must plan for unending migration of their collections to new modes of storage.
Digital Library Collections
The materials in a collection may be found in a single group, or they may be contained in multiple, nonoverlapping subcollections, which may be housed at physically and logically distinct locations. This raises the problem, which is new to digital libraries, of combining documents found in several distinct collections into a set that will be most useful to the user of the library. This problem is commonly known as "collection fusion."
In the world's libraries, most volumes are text, in some human language. These form a major part of digital libraries as well. Along with texts, libraries contain other texts about those texts. Data added to a collection, or to an object in the collection, for the purpose of identification is called "metadata" (i.e., "data about data"). An early example of metadata is the catalog card in a mid-twentieth-century library. In a digital library, some metadata are prepared by hand (i.e., "cataloging"). Metadata may also be generated automatically from machine-readable text that has been analyzed to support indexing and retrieval of passages and documents by subject.
Some digital library collections contain images or graphics that have been created by human effort (e.g., sketches, oil paintings, hand-designed maps). The object in the collection may, alternatively, be a mechanically produced (i.e. photographic) representation of a human product or artifact. Other images may be photographic or tomographic representations of naturally occurring scenes or objects. Each kind of image poses specific problems for the digital library. For a humanly produced artifact, information about the author and the date and place of production are both useful and (in principle) knowable—although there can be identification and attribution problems with older artifacts. Newly produced artifacts can be labeled or "tagged" with this information. Typically, the added information is permanently linked to the information object (e.g., by placing both in a single machine-readable file).
Metadata about images may also include more technical descriptions of the imaged object and of the imaging process itself—for example, "church in northern New Brunswick, imaged at 4:30 P. M. on April 13, 1999, in infrared light of wavelength 25 microns." Part of this information (that the object is an image of a church) must be provided by a human analyst. On the other hand, the wavelength of the infrared light, the time at which the image was made, and the precise geographic location of the imaged object (which may be determined by a global positioning system device built into the digital camera) can all be automatically included in the machine-readable record.
Is the World Wide Web a digital library? Strictly speaking, it is not; it is better regarded as an interconnected set of different libraries that contain different kinds of collections and serve different communities of users. In fact, a given person may belong to several communities of users (e.g., a dentist may, on the one hand, may search the web for information about dental procedures and, on the other hand, for information about a gift for his or her son). Most digital libraries are best regarded as being made up of several related collections or subcollections. Users access these collections through two kinds of technologies: engines and interfaces. The engine is the collection of computer programs that locate documents for indexing and the programs that build indexes. The interface is the collection of computer programs that lets a user "see" the organization, the contents of the index, and often the documents themselves.
Digital Library Technologies
Indexing of texts by the words that they contain is an example of generating metadata from features. In this approach, the features of the document are simply the words that it contains.Indexing is based on things such as the frequency of words, which is taken to be an indication of what the document is about. In a sense, there becomes a "library catalog card" for every occurrence of a meaningful word and every book in which it occurs. Of course, the technology is more efficient, but the access provided has the same power that a huge card catalog would.
Machine-readable records of sounds are difficult to index automatically. If the sound is a human voice speaking some natural language, voice recognition tools may be used to create an approximate transcription of the speech, which becomes a powerful basis for the creation of meta-data for retrieval purposes. For naturally occurring sounds, such as birdcalls, the first line of classification is based on an analysis (using waves and wavelets) of the physical properties of the sound.
For texts themselves, the distribution of words in the text forms the basis for defining many kinds of features. These include the frequency with which a given term occurs in the document, the presence or absence of terms, the location of terms within the document, the occurrence of two-and three-word phrases, and other statistically defined properties of the text.
As a shared national priority, the development of digital libraries is addressed both by existing libraries and by programs created by the national government and by philanthropic organizations. The National Science Foundation, the National Institutes of Health, and the U.S. Department of Defense, among others, joined temporarily in the early 1990s to solicit and fund projects to develop key technologies for digital libraries. This program was later subsumed into the broader Information Technology Research initiative, which addresses both key technologies and the building of collections.
In addition, organizations such as the private Getty Foundation and the Andrew W. Mellon Foundation encourage, with their funding, efforts aimed at building specific collections of images or texts in digital form. These projects often have a major research component, such as the Columbia University Digital Library initiative. In general the goal of these initiatives is to "manage the revolution" by ensuring that each major experiment in digital libraries is conducted and documented in a way that will ease the path for those who come later.
Management and Policy Issues
Digital libraries pose new problems related to management and policy. For example, rapid advances in technology make most forms of digital storage either obsolete or difficult to support after a period of about ten years. Therefore, digital libraries must provide for a continual process of "preservation" or "conservation" of the content of its collections by moving them from soon-to-be obsolete media to more contemporary media. Since the price of media is generally highest when they are new and falls sharply as they reach the end of their periods of market dominance, this poses difficult economic policy issues for digital libraries. This "migration" problem stands in sharp contrast to the ease with which one can still read a paper book printed more than two hundred years ago.
Policy issues also arise in the protection of authors, publishers, and readers from various kinds of exploitation. The methods (i.e., web-based browsers) that are used to deliver the contents of digital libraries require that a separate physical copy of the document reside on the computer (either in the volatile random access memory or on the more permanent hard drive storage). This copy might then be appropriated, adopted, modified, and used in other documents, thereby depriving the author and the publisher of revenue, credit, or both. For materials that are essentially images, there are methods, called "digital water-marking," for embedding a unique identifier in the image. While this does not prevent misappropriation of the material, it does support post facto discovery of that activity and the search for legal remedies. However, standard text processing tools can convert some kinds of page descriptions—for example, those made available in the Adobe postscript format or the portable data format (pdf)— into pure text, which carries no watermark.
Protection of readers is also an issue because, even in an open political system, some materials are deemed inappropriate and potentially harmful to some groups of readers, particularly the young. There are both technical and policy problems related to identifying and tagging such materials automatically, as well as to knowing which readers should be permitted access to materials of each identified class. Efforts to resolve this problem legislatively are ongoing in the United States.
Costs and Benefits of Digital Libraries
While digital libraries represent an exciting new technology, they carry costs as well as benefits. As both technological and social entities, digital libraries compete for scarce resources, so they must be evaluated. The evaluation process must serve a diverse group of stakeholders, including individuals (e.g., readers, authors, librarians), corporate entities (e.g., libraries, host institutions, publishers), and national entities with shared interests (e.g., health, science, education).
The economics of digital libraries is still in its infancy, but it seems to be characterized by several key features: (1) observation of the library system in use, as opposed to a simple assessment of its size or collections, is essential to evaluation, (2) observations must be reduced both to numerical measures of some kind (e.g., statistics) and to comprehensible narrative explanations, and (3) the resulting measures and narratives must make sense to various groups of stakeholders in order to support decisions.
While much library material is ephemeral (e.g., newspapers), much of the value of libraries lies in the works of art and of scholarship that they contain. With regard to scholarship, colleges and universities are still developing the policies that will encourage (or discourage) the publication of digital works by their faculties. Unless such work is recognized, scholars will not produce in the new formats. There are challenging issues related to ownership of digital collections. The effort expended in digitizing collections must be paid for either by concurrent funding (e.g., government-sponsored programs) or with borrowed funds that must be repaid by sale of access to the materials. In a nation committed to equal access to information, many of whose leaders in science and industry obtained much of their basic knowledge in the public libraries of the twentieth century, issues of ownership and access will be among the pressing national policy problems of the twenty-first century.
Overall, it seems likely that libraries in digital form will become the norm. Systems for duplication and migration will be needed to ensure the permanence of the cultural heritage in this form. Sound economic frameworks will be needed to ensure that access is provided in ways that benefit the society as a whole rather than only those who can afford to pay for the newest technology. All in all, an ever-accelerating technology will make the digital library an increasingly effective servant and collaborator for the society that develops and maintains it.
See also:Cataloging and Knowledge Organization; Communications Decency Act of 1996; Databases, Electronic; Internet and the World Wide Web; Libraries, Functions and Types of; Libraries, National; Library Automation; Preservation and Conservation of Information; Retrieval of Information.
Columbia University Libraries. (2001). "Columbia University Digital Library Initiative." <http://www.columbia.edu/cu/lweb/projects/digital/>.
Faloutsos, Christos. (1996). Searching Multimedia Data Bases by Content. Boston: Kluwer Academic.
Getty Foundation. (2000). "Research at the Getty."<http://www.getty.edu/research/index.html>.
Kluwer Academic Publishers. (2001). "Information Retrieval." <http://www.wkap.nl/journals/ir>.
Lesk, Michael. (1997). Practical Digital Libraries: Books, Bytes, and Bucks. San Francisco: Morgan Kaufmann.
National Science Foundation. (2000). "Information Technology Research." <http://www.itr.nsf.gov/>.
National Science Foundation. (2001). "Digital Libraries Initiative." <http://www.dli2.nsf.gov/>.
Salton, Gerard, and McGill, Michael. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.
Springer Verlag. (2001). "International Journal on Digital Libraries." <http://link.springer.de/link/service/journals/00799/index.htm>.
Van Rijsbergen, C. J. (1979). Information Retrieval, 2nd edition. London: Butterworths.
Paul B. Kantor