Electronic databases are organized collections of data, or information, that are stored in computer-readable form. In general, electronic databases are of two types: those that can be accessed by large mainframe computers and those that can be accessed by small personal computers. However, this distinction is becoming less important as small (in physical size) computers continue to increase in power. In general, mainframe data-bases—most of which are highly specialized—are maintained by large businesses, institutions, and government agencies. Databases can be either publicly available or private. Private databases can be accessed only by employees of the organization that maintains the databases. Public databases are designed for access by the public. Databases for personal computers typically are created and used by individuals, small businesses, and units within large businesses; they can be used for a wide variety of purposes.
The term "database" is used in two senses. One refers to the organized collection of data that is created, maintained, and searched. The other refers to the software that is used to create and maintain the data. Database management systems are often simply called "databases." This entry concentrates on large, publicly available databases, together with the services that make them available.
The term "data" refers to facts, numbers, letters, and symbols that describe an object, idea, condition, situation, and so on. Data elements, which are the smallest units of information to which reference is made, are combined to create records. Data elements in a bibliographic reference include the names of the author or authors, the title of referenced work, the journal name, the pagination, the volume number, the issue number, and the date of publication. A data set is a collection of similar and related data records or data points that have not yet been organized for computer processing. A data file is an aggregation of data sets or records treated as a unit. While databases are also collections of related data and information, the difference between a data file and a database is that a database is organized (by a database management system) to permit the user to search and retrieve or process and reorganize the data.
The data in a database may be predominantly:
- word oriented (e.g., textual, bibliographic, directory, dictionary, full text),
- numeric (e.g., properties, statistics, time series, experimental values),
- image—both fixed images (e.g., photographs, drawings, graphics) and moving images (e.g., a film of microbes under magnification or time-lapse photography of a flower opening), or
- sound (e.g., a recording of the sound of a tornado, wave action, or an explosion).
The discussion in this entry is concerned primarily with digital data, although a large portion of raw data is recorded as analog data, which also can be digitized. Digital data are represented by the digits zero to nine. In the case of analog data, numbers are represented by physical quantities (e.g., the lengths obtained from a slide rule, the measurements of voltage currents). These physical quantities can be converted to digital data through an analog-to-digital converter. Because word-oriented, numeric, image, and sound databases differ, they are processed by different types of software that are specific to each type of data. Digital data may be processed or stored on various types of media, including magnetic media (e.g., tapes, hard drives, diskettes, random access memory) and optical media (e.g., CD-ROMs, digital video discs). Users can access the data either through portable media or, more generally, through online sources.
The term "data" can refer to raw data, processed data, or verified data. Raw data consist of original observations (e.g., those collected by satellite and beamed back to Earth) or experimental results (e.g., laboratory test data). Raw data are subsequently processed or reduced to make them more useable, organized, or simplified. Large data sets need to be cleaned, processed, documented, and organized to enable their use. These activities are occasionally called "curation." In general, the more curated the data are, the more broadly useable they become to users outside the original research group or subdiscipline. Verified data are data whose quality and accuracy have been assured. For experimental data, this means that the original test or experiment has been duplicated and the same data have been produced. For observational data, it means that either the data have been compared with other data whose quality is known or the instrument through which the data were obtained has been properly calibrated and tested.
Many databases are used to retrieve and extract specific data points, facts, or textual information for use in building a derivative database or data product. A derivative database, which is the same as a value-added database or a transformative database, builds on a preexisting database and may include extractions from multiple databases as well as original data. When dealing with derivative databases, the question of intellectual property rights arises and must be resolved.
Availability of Online Public Databases
The range of public databases has grown to the extent that it is now possible to find data on almost any subject. Databases have been created for nearly every major field and many subfields in science, technology, medicine, business, law, social sciences, politics, arts, humanities, and religion as well as for news (worldwide, regional, or subject-related), problems (specific to topics and organizations), missions (such as transportation, defense, shipping, robotics, oil spills, solid waste), and consumer interests such as shopping and automobile repair.
The first comprehensive database directory, Computer-Readable Databases (CRD), was compiled and edited by Martha E. Williams and was published in 1976 by the American Society for Information Science. CRD originally covered 301 publicly available databases, but by the mid-1980s, the number of publicly and commercially available electronic databases that were listed in CRD had grown to more than three thousand. Gale Research, Inc. (which became the Gale Group in 1999) acquired CRD in 1987 and continued to publish it until 1992, when they renamed it the Gale Directory of Databases (GDD). By the year 2000, GDD had grown to include more than twelve thousand databases. Both CRD and GDD included all types of public, commercial databases (i.e., word-oriented, number-oriented, picture-oriented, and sound-oriented databases), as well as multimedia, which include combinations of these types.
When a database is developed for public use, it is usually made accessible to users through a telephone connection to the host computer ("online") where it resides; wireless access, however, is gaining importance as a technology for access. Database services may be provided by the producer of the database or, more commonly, by a separate organization that offers online searching of one or more databases.
In order to find information online, one needs to know which database is likely to contain that information. There are several ways of identifying specific databases. One way is through the use of printed directories such as GDD. Another way is through online directories that are maintained by search services for those databases on their systems. Yet another way is through the various search engines on the World Wide Web. Search engines may use various methods to index or catalog web-site contents. Web crawlers robotically go from website to website and index their contents. Examples of web crawlers include Alta Vista, Excite, Hot-Bot, Magellen, and WebCrawler. Some search engines, such as Yahoo, Lycos, and LookSmart, use directories that are generated by humans who intellectually catalog websites. Metacrawlers check many search engines to produce a single list of databases so that the user does not have to check each search engine individually.
Organizations that provide online search services are also called online vendors. They have the computers and software (computer programs) that allow outside users to search databases themselves for data and information, whether it is in the form of numeric data, text, images, sounds, or a mixture of these formats.
Users and Access
Users of public databases include most groups of people whose profession, business, and educational activities require quick access to information. This includes scientists, lawyers, doctors, stockbrokers, financial analysts, librarians, executives, students, and other researchers. Some public databases and search services are focused on consumer needs, providing access to such information as flight schedules, merchandise catalogs, movie reviews, theater schedules, restaurant information, and hotel/motel availability and reservation services. In addition, there are financial, bibliographic, and other services that were initially developed for professional and business users.
Online access to a database usually requires that the user has computer access to an account with a search service that offers such access, a password to log onto the service, knowledge of how to use the service, and information about specific features of the database.
Procedures that users need to know in order to take full advantage of search services and the databases to which they provide access vary widely in complexity. This complexity depends on the type of stored information and the user that the database was designed to serve. For example, searching a database for physical or chemical properties of a certain class of substance requires a different and less widely held kind of knowledge than does searching a database for the names of theaters in a given geographic area. Similarly, an online system intended for professional researchers who use the system daily can be very complex and therefore will contain more useful features than one aimed at occasional end users. Some database producers and/or vendors offer their services to users over the Internet, providing access to all or a sampling of their database product, either for free or for a fee.
Types of Databases
Databases are organized and maintained in different ways for different types of information (i.e., words, numbers, sounds, and images). Each information type has a distinctive machine representation and requires a distinct kind of software. Word-oriented databases contain words, phrases, sentences, paragraphs, or text as their principal data. The principal data in numeric databases, often called "databanks," consist of numbers and symbols that represent numbers, statistics, experimental values, time series (i.e., events or phenomena observed over a span of time), tables of numbers, graphs that are based on such tables, and similar material. Pictorial databases, many of which are constructed for scientific or engineering purposes, may contain representations of virtually any multidimensional structure (e.g., chemical structures, nuclear particles, graphs, figures, photographs, architectural plans, and geographic maps). Moving picture databases can represent virtually anything shown in motion. Audio databases, which contain sounds, can represent music, voices, sounds of nature, and anything than can be heard.
Alphabetic and alphanumeric strings of characters cannot be handled by numeric processing software. In other words, these strings cannot be added, subtracted, multiplied, or divided. Therefore, they require software that is designed specifically for handling character strings. Word-oriented databases allow the users to search the database for strings of characters that match the strings of characters in, for example, names, titles, and keywords. Most of these databases allow the user to search using partial words (i.e., truncated words that use a wildcard symbol such as an asterisk to permit multiple endings on the word stem). For example, a user who conducts a single search using the string or partial word "bridg*" would be able to retrieve information related to "bridge," "bridges," "bridged," "bridging," and other similar words or phrases. Word-oriented databases were the earliest publicly available electronic databases They were introduced in the 1960s and contained predominantly information related to science, engineering, technology, and medicine. These early databases contained bibliographic references to published scientific and technical literature, and there were initially only a few dozen of them. They have since multiplied into the thousands.
Bibliographic databases range in size from small files such as the Acid Rain database (with approximately four thousand citations) and the Age Line database (with approximately fifty thousand citations) to large files such as Medline (with more than eleven million citations in the biomedical and health sciences fields). Chemical Abstracts Service produces several databases, which in the year 2000 collectively included more than twenty-two million document citations to documents.
Full-text databases provide access to the texts of such documents as legal cases and statutes, wire services, journal articles, encyclopedias, and textbooks. Except for the Lexis-Nexis service, which has a large set of legal databases that are mostly grouped in "libraries" of databases, most of the full-text databases were established after 1980. The first full-text database, Lexis, was established in 1973 by Mead Data Central (which later became the Lexis-Nexis service). Lexis-Nexis is one of the world's largest word-oriented database services, and among services that have legal databases, it is approached in size only by Westlaw, a legal database service established in 1975 by West Publishing Company (which later became the West Group). The Westlaw service includes billions of pages of information in thirteen thousand databases in a few dozen "libraries" (all represented as a few dozen entries and umbrella entries in GDD).
Online newspapers, newsletters, journals, and textbooks are among the numerous full-text databases that are available online. Examples include the United Press International and Associated Press wire services, The New York Times and Wall Street Journal newspapers, and U.S. News and World Report and Newsweek magazines. Examples of electronic journals are the Harvard Business Review and many of the American Chemical Society journals. Electronic encyclopedias include the Academic American Encyclopedia and Encyclopaedia Britannica.. Among the many thousands of medical textbook databases are Gray's Anatomy, Textbook of Surgery, and Principles and Practices of Emergency Medicine.
In numeric databases, numbers and symbols are the principal data that are stored and processed. Generally, compared to word-oriented databases, numeric databases involve less fetching and character-string matching and more processing. Most of the programming for a numerical database involves manipulating the data mathematically and presenting it in reports that are formatted and labeled in forms that are familiar to the specific class of users for which the database is designed. Statistical routines, time series, and other programs for manipulating data mathematically work in the same way for numeric data regardless of whether the data relate to sociology, economics, finance, chemistry, or any other field. One example of a large time series database is the National Online Manpower Information Systems (NOMIS), which is produced by the University of Durham in England and has more than twenty billion time series records in its databases.
Pictorial databases are relatively specialized and are fewer in number. Their data consist chiefly of specifications for shapes, distances, geometrical relationships (including three-dimensional relationships), colors, and the like. The computer processing of pictorial data (including photographs and videos) requires sophisticated programs for such functions as video pattern matching, coordinate matching, and extraction of specific features of photographs, maps, videos, or other pictorial representations. Computer processing of sounds has its own set of requirements for matching and analyzing sounds (e.g., by parsing and other techniques).
Production and Distribution
Databases are produced by a wide variety of commercial, governmental, academic, and nonprofit organizations. The way in which a database is created depends on whether it is a primary database (e.g., containing the text of an original article) or a secondary database (e.g., providing references, abstracts, or index entries associated with an original article). To prepare a secondary database, the producers cull the primary literature for source material, books, journals, dissertations, government reports, and conference proceedings in order to identify items that are relevant to the subject area of the database. For each item selected for mention in the secondary database, the producers prepare a bibliographic record that lists the names of the author or authors, the title of the article or book, and further information that is needed in order to find the cited publication. The record is then entered into the database, and individual data elements (e.g. author, title, date of publication, journal name, volume number, issue number, page range, and so on) are identified by a specific code or position in the record. In some bibliographic databases, the records include index terms and/or keywords for the articles and books that are referenced. Other bibliographic databases also include abstracts of the articles.
Most large databases are updated periodically (e.g., monthly, weekly, continuously). These updates may be put on magnetic tape and shipped, they may be transmitted directly to search service organizations for incorporation, or they may be made available for downloading from the producer's website. Some small databases are issued on floppy disks or CD-ROMs for use on personal computers. Other small databases are sold as a part of handheld devices that contain both information and searching capabilities. Some large databases are sold or leased to government agencies and corporations for in-house use. Other large databases are sold or licensed to online search services where they are reformatted by the search service's software or search engine in order to allow searching by their customers.
Electronic databases are accessed mainly through online search services (i.e., database vendors) and/or directly through the Internet. These services provide online databases together with software for search and retrieval, data manipulation, and modeling. They are sometimes called "information utilities," because, like electric or gas utilities, an online search service serves a widely distributed network of users. Several hundred such services in the United States and Europe provide access to more than twelve thousand databases and databanks worldwide with billions of records.
If a database is part of a commercial online service, anyone with a microcomputer, a modem, and a telephone can have access to it for a fee. The search fee includes charges for accessing the database itself and for the use of its search software. There may also be charges for printing or downloading search results.
The fees required for using search services vary widely from service to service and from database to database. Many services charge only for the actual use of the service. Others require subscription fees, monthly or yearly minimum payments, and the like. Information that is available at web-sites on the Internet may be entirely free, or it may require a payment.
Charges usually are based on usage or on units accessed, retrieved, or delivered. Usage is measured in terms of connect time (i.e., the number of minutes that are used to carry out an online search), or in terms of the number of records accessed, viewed, retrieved, downloaded, or printed, or in terms of computer resource units. Resource units measure the amount of the computer facility (including machine time and storage capacity) that is used in a search. The units accessed, retrieved, or delivered may be, for example, bibliographic references in a bibliographic database, individuals identified in an employment database, or time series in a time series database. The units may be displayed on the user's terminal, printed out by the search service and sent to the user, or, more commonly, downloaded by the user for local printing and use.
Among the commercial online services for searching numeric databases are Standard & Poor's DRI (Data Products Division), GE Information Services, The WEFA Group, and the Oxford Molecular Group (Chemical Information System). All of these except Chemical Information System provide mainly business-oriented databases; Chemical Information System provides mainly scientific databases. Among the vendors of word-oriented databases are Lexis-Nexis, The Dialog Corporation (DIALOG Information Services, Inc.), the U.S. National Library of Medicine, West Group (Westlaw), Compu Serve Information Service, America Online, Inc., and Dow Jones and Company, Inc.
DIALOG, the largest of the online search services that provide mostly bibliographic databases, began offering commercial search services in 1972. At that time, it featured two government-produced databases—ERIC (Educational Resources Information Center) and NTIS (National Technical Information Service). By the year 2000, DIALOG had several hundred databases with nine terabytes of data. The U.S. National Library of Medicine began its search service in 1971, offering the Medline database with 147,000 records. By the year 2000, the U.S. National Library of Medicine had dozens of databases with about eleven million records. Lexis-Nexis, the largest service to provide mostly textual databases, introduced its commercial online service in 1973 with a database of 208,000 documents or 2.5 billion characters. By the year 2000, it had burgeoned to more than 2.8 billion searchable documents or 2.6 trillion characters of data.
See also:Bibliography; Cataloging and Knowledge Organization; Computer Software; Computing; Database Design; Information Industry; Internet and the World Wide Web; Knowledge Management; Libraries, Digital; Library Automation; Management Information Systems; Systems Designers.
Faerber, Marc, and Nagel, Erin, eds. (2000). Gale Directory of Databases, 2 vols. Detroit, MI: Gale Group.
Williams, Martha E. (1985). "Electronic Databases."Science 228:445-456.
Williams, Martha E. (1994). "Implications of the Internet for the Information Industry and Database Providers." Online & CD-ROM Review 18(3): 149-156.
Martha E. Williams