Cataloging and Knowledge Organization

views updated

CATALOGING AND KNOWLEDGE ORGANIZATION

If information cannot be found when it is wanted, it cannot be integrated into the world of human knowledge or into an individual's personal knowledge base. Whether people want to write a newspaper article, complete a project, or learn about a new hobby, they need to be able to find information that relates to what they are doing and to what they want to know. The overall purpose of cataloging and knowledge organization is to help people achieve the goal of finding information as easily as possible when they need it. This goal may seem to be a simple one, but accomplishing it is not necessarily easy or straightforward. For example, the way in which information is described and organized should ideally be consistent within one information medium, compatible with other information media, and predictable and appropriate for different kinds of information in different media. In addition, description and organization of information should be flexible enough to accommodate all the different assumptions, views of the world, and natural languages that human beings currently employ, as well as those that they have employed throughout the history of recorded information.

People have recorded information in many ways and in many forms. One general term for all of these information containers is "item." Information items can be textual (e.g., books or magazines), nontextual (e.g., paintings or sculptures), or a combination of the two (e.g., musical scores or maps). In addition, these information items can be physically stored in institutions such as libraries, museums, or archives, and/or they can be virtually stored in databases (textual or non-textual) for private use (e.g., within an organization) or for public use (e.g., through the Internet). This variety of possibilities has motivated information professionals to develop standardized and nonstandardized ways of helping people find what they want.

Recording Data about Information Items

Three processes help information professionals create access for users. These are the description of an item, the choice of descriptive elements as access points (i.e., data that may be searched), and the entry of the description into a file that is either manually or electronically searchable.

The description of an information item is a surrogate representation for it. A surrogate record stands for the information item in a manual or an electronic file. The purpose of the description is to allow people to decide whether they want to look at the thing itself. For example, the surrogate description for a book includes both physical characteristics (e.g., number of pages and dimensions) and intellectual characteristics (e.g., title and subject). These and other data elements (e.g., author, publisher, date of publication) help people decide whether they want to read the book. In general, the process of creating a description and assigning access points is known as "cataloging." The process of creating "metadata" has roughly the same meaning, but it may include how the description is put into machine-readable form, where the item may be found, and/or its relationship with other items. "Resource description" is a similarly broad term for methods of creating surrogates for any kind of item. The surrogate might be a cataloging record, an abstract or summary, or a thumb-nail picture of the item. Clearly, if descriptions of items are standardized and predictable, people will more easily find the information they are looking for because they can make a comprehensive and complete search of an information file.

Charles Cutter (1904) identified three purposes for cataloging: (1) to allow someone to find an item with a known creator, title, or subject, (2) to allow someone to discover what an institution has about a certain topic, and (3) to allow someone to select an appropriate item from among a number of similar ones. These three goals still guide any catalog or other finding aid. The first objective is met by cataloging rules and codes, the second is met by knowledge organization systems, and the third is met by both kinds of systems.

Cataloging Rules and Codes

Standardized rules for cataloging have a long history. The most widely used cataloging code in the English-speaking world is the second revised edition of the Anglo-American Cataloguing Rules (AACR2, 1998). These rules were developed by an international Joint Steering Committee that included members from the United States, Canada, Australia, and Great Britain. AACR2 descriptions conform to a more general standard, the International Standard Bibliographic Description (ISBD), which was produced by a 1969 meeting of cataloging experts in Copenhagen, Denmark. The ISBD had three general goals. Its creators wanted to ensure (1) that records produced in one country in one language could be understood in other countries that might use other languages, (2) that records produced in one country could be integrated into files produced elsewhere, and (3) that records could be converted into machine-readable form. The first ISBD standard was developed for monographs, and since then standards have been developed for printed music, nonbook materials, maps, computer files, and antiquarian materials, among others. Similarly, AACR2 contains rules for describing items in these different formats.

AACR2 is divided into two parts. Rules in the first part prescribe how to record data about the item using eight different areas or fields. These eight areas are either preceded or surrounded by punctuation marks that differentiate among the various roles a person or an institution played in the creation of the item. For example, in the area for the title and statement of responsibility for the item, the statement from the item that names its creator(s) is preceded by "/ " (i.e., character space-slash-character space). Different punctuation marks are used for different information elements in the cataloging record. This kind of standardized punctuation allows people (or computers) to understand the information in the record without necessarily being able to read the language in which the record is written.

The second part of AACR2 contains rules for choosing access points and for standardizing the information content in a surrogate description. For example, a creator often uses a different name for different works. People may use different forms of the same name, or they may use entirely different names. Sometimes, people change their names or write under pseudonyms. First, a cataloger needs to decide which of these names to use in the surrogate record. Next, the cataloger needs to decide which form of that name to use in the surrogate. The purpose of these decisions is to establish a standardized name in a standardized form based on which name is likely to be known to the most people. For example, The Adventures of Huckleberry Finn was written under the pseudonym Mark Twain, but the author's real name was Samuel Clemens. Because Samuel Clemens never wrote under his real name, the cataloger will choose Mark Twain as the name most people are likely to know and use for a search.

In addition, different people can have the same name, and the cataloger needs to distinguish among people who have the same name in order to separate items created by one person from those created by another. For example, works by the nineteenth-century American novelist Winston Churchill need to be distinguished from works by the twentieth-century British prime minister with the same name. One way to make this distinction is to add a person's birth and/or death dates to the name. Another way is to add extra elements (e.g., a middle name). For names in English, the surname is usually listed first followed by the forename(s). This practice is familiar through, for example, telephone books. The process of choosing names and other access points and of establishing standardized forms for them is called "authority work." Authority work includes making references from unused names and from unused forms of a name to the standardized name and form so that, for example, people who look up one name (e.g., Clemens, Samuel) are directed to the chosen name (e.g., Twain, Mark). In this way, authority work ensures that people who are interested in works by the novelist Winston Churchill do not retrieve works by the prime minister Winston Churchill.

Various other sets of standards and rules that have been developed for generating surrogates in both manual and electronic information environments include Archives, Personal Papers, and Manuscripts, the Dublin Core Elements, and Encoded Archival Description. Each of these systems is appropriate for a particular kind of information item, and each has its own set of useful data elements for describing an item and establishing appropriate access points for it. Tools (called "crosswalks") for comparing the various standards have been developed to help information professionals understand differences and similarities among different standards, such as the different definitions each standard may give to the title or the creator of a work. Crosswalks establish which field(s) from one standard map onto which field(s) in another. Similar to translating between different natural languages, translating between different standards is not automatic, but it is an important activity because it allows one to merge files that contain records using different standards. In this way, records can be shared, and more people have access to the record for an item they are interested in retrieving.

Knowledge Organization Systems

One of the objectives that Cutter (1904) had for a surrogate system was to allow people to find items that have the same topic or subject. The topic of an item is what it is about (e.g., landscape painting, theoretical astrophysics, gardening, or how to fly an airplane). The term "knowledge organization" encompasses different methods for organizing information, but the term is sometimes used for information about a topic or subject. Standardized (i.e., alphabetical systems, classification systems) and nonstandardized methods of specifying subjects have been developed, all of which can be used in both manual and electronic environments to help people retrieve the information they want.

Standardized Methods

A cataloger analyzes an information item to determine its topic and the concepts it uses and then translates the concepts in the analysis into a standardized or controlled vocabulary. Standardized methods of knowledge organization include systems that are primarily displayed alphabetically (e.g., subject-heading systems and thesauri) and systems that are primarily displayed systematically (e.g., classification and ontological systems). These two types of systems are not mutually exclusive because alphabetical systems include classificatory elements, and classificatory systems include alphabetical elements. Both kinds of system are used to organize resources on the Internet(e.g., Beyond Bookmarks) and in nonelectronic information environments.

Subject-heading and thesaural systems are called "controlled vocabularies" because the particular terms the system prefers for expressing each concept are chosen in advance and controlled by the system developers. Searchers are guided to these preferred terms by networks of references that are called the "syndetic structure" of the system. Assigning subject headings to information items is usually called "subject cataloging" and assigning thesaurus terms is usually called "indexing."

Subject-heading lists provide words and/or phrases that may be used as access points for subjects. Subject-heading lists are often used in libraries and are usually created for knowledge in general. These systems provide networks of terms to describe the subjects in a document. Library of Congress Subject Headings, first published in 1914, is used in many large academic and national libraries in English-speaking countries. Usually, a cataloger gives a book more than one subject heading, and in an online system subject headings can be searched by keywords. That is, the searcher does not have to know the exact form of the subject heading in order to use it for searching.

Thesauri began to be developed in the 1950s. Thesaural systems are similar to subject-heading systems in providing lists of consistent terms that are assigned to an information item by an indexer. Unlike subject-heading systems, however, thesauri are usually created for a particular field. For example, the Art & Architecture Thesaurus, published by the Getty Information Institute, provides access to all kinds of heritage information items (e.g., texts, images, museum materials). In addition, the syndetic structures of thesauri are usually more strictly controlled than those of subject-heading systems, and the terms in them are defined for the particular purposes of that field of knowledge.

Both subject-heading and thesaural systems include codes that describe the relationships of one term to other terms. The most common relationships are "broader term" (BT), "narrower term" (NT), and "related term" (RT). A broader term names a concept that is wider in scope than another. For example, the concept "precipitation" is broader than "snow." A narrower term names a concept that is more specific. For example, the concept "oak tree" is narrower than "tree." A related term is associated in some way to the term in question but is neither broader nor narrower in scope. For example, "light" is related to "color" and may interest a searcher who has looked up "color," but the two terms do not have a broader/narrower hierarchical relationship. In addition, some terms are preferred terms (called "used terms"). Terms not preferred by the system are called "unused terms." Unused terms are considered synonyms for used terms and cannot be used for searching. For example, "wig" may be a synonym for "hair." People who look up an unused term (e.g., "wig") are directed to search with a used term instead (e.g., "hair").

Controlled vocabularies are useful in information retrieval systems because the terms assigned to information items can be used to search a database. Searching with an assigned term ensures that all the records that have been indexed with that term are retrieved. Certainty that all the relevant records have been found means that a searcher can feel confident that the search was comprehensive. Otherwise, the searcher would have to think of all the possible synonyms of a term in order to be sure that the search was complete.

Classification systems are structured systems that divide some knowledge domain into groups on the basis of likenesses and/or differences among the members of each group. The study of classification dates back at least to the philosophers of ancient Greece. Modern bibliographic classification systems started to appear in the late nineteenth century. In an ideal classification system, the classes are both mutually exclusive and jointly exhaustive. That is, the classes do not overlap (i.e., mutually exclusive), and all the classes taken together encompass all possible content so that nothing is left out (i.e., jointly exhaustive). This ideal cannot be fully achieved because new members of the classes can be discovered or invented at any time. Nevertheless, the ideal can be used to help evaluate classification systems because one can assess the classes for mutual exclusivity and joint exhaustivity.

In North America, most libraries use either the Dewey Decimal Classification or the Library of Congress Classification (in which each class is published separately). Both of these classification systems are called "enumerative systems" because they seek to list all of the possible topics that documents may have. In libraries, classification systems are used both to show the place of a particular topic in the context of the world of knowledge and also to provide a shelf address for each document. On the Internet, classification systems (e.g., DESIRE) often provide an address or hyper-link to the relevant site. Researchers into artificial intelligence have begun to create ontologies (i.e., classification systems) for real-world knowledge so computers can represent contexts, understand human languages, and recognize how things in the world are related to each other.

Most classification systems have a hierarchical structure in which the attributes of a class on a higher level are shared by those on the lower levels. For example, a document about Canadian history in general will not be as detailed on each of its constituent topics (e.g., the Canadian constitution) as a document that deals only with that topic, but a document about the narrower topic will also contain elements of the broader topic. For example, a document about the Canadian constitution will also deal to some extent with Canadian history in general. Unlike subject-heading systems and thesauri, classification systems are displayed structurally, not as an alphabetical list. Each class has a notation that represents the place of the class in the world of knowledge and in the system and that shows its relationships to a hierarchy of other classes. For example, part of the Dewey Decimal Classification schedules for "technology" (with growing specificity) is 600 for technology (applied sciences), 630 for agriculture and related technologies, 636 for animal husbandry, 636.7 for dogs, and 636.8 for cats.

Notation can be numeric, alphabetical, or mixed alphanumeric. For example, the notation for the topic "economics of education" is 338.4337 in the Dewey Decimal Classification and LC65 in the Library of Congress Classification. Hierarchical relationships may also be shown in the notation. For example, in the Dewey Decimal Classification, "Canadian history" is notated as 971, where the 9 stands for "history," the 7 stands for "North America," and the 1 stands for "Canada." The Dewey Decimal Classification notation 971 thus shows that history is a broader concept than North America and that North America is a broader concept than Canada.

One relatively recent development in the creation of classification systems is the construction of faceted systems. Facet theory was developed by Shiyali R. Ranganathan in India and refined in his Colon Classification (1964). Facet analysis divides a subject field into mutually exclusive groups called "facets" and then divides each facet into its constituents. For example, the material facet for furniture would contain terms for the various kinds of materials from which furniture can be made (e.g., wood, metal, cloth, plastic). Each of these terms has its own notation, and notations from different facets can be synthesized to express a complex topic. For example, one might express the topic "red plastic tables" with notational elements from the color, material, and type facets. The idea of facet analysis has also been adopted for the development of thesauri. Its advantage is that all topics do not have to be listed, and a notational subject statement may be built up in a way that is similar to constructing a sentence from component words in a natural language. Another faceted classification system is the Bliss Bibliographic Classification (devised by Henry Evelyn Bliss and edited by Jack Mills and Vanda Broughton), which is based on Ranganathan's theories and incorporates other advances from modern classification research.

The ability to search a database using notations as search terms means that the searcher does not have to know the human language that is used in the records. For example, using the Dewey Decimal Classification notation 636.8 ("cats") for searching a database in which each record has been assigned one or more notations will retrieve records in English, Spanish, Chinese, Russian, or any other natural language. The searcher does not have to know the word for "cats" and its synonyms in all these languages. This ability is particularly useful in multilingual information environments.

Nonstandardized Methods

Nonstandardized methods of knowledge organization have been developed and are used for accessing the content of an individual document. An abstract is a brief summary that contains only the most salient points from the document and is often written by a professional abstractor, not by the originator of the document. Abstracts are often included at the beginning of a journal article and, in an electronic environment, these abstracts can be searched to find words in uncontrolled vocabulary that are of interest to the searcher. Individual documents such as books often have an index that refers only to that document and its page numbers. These back-of-the-book indexes are created by professional indexers, and no standardized method has been developed. Each book also has a table of contents that includes the names of chapters and/or sections in order to help readers find what they want. In the case of both abstracts and back-of-the book indexes, searching with an uncontrolled vocabulary means that one can never be certain that all the relevant material has been retrieved or that the search has been comprehensive.

Producing Files in a Standardized Format

Individual surrogate records are entered into a file to create a manual or computerized catalog, list, directory, index, guide, or register that can be searched. In a manual (i.e., printed) file, the display format is usually established by a publisher (e.g., for a book) or by an institution (e.g., for a library catalog). For computerized resources (e.g., a database), information is encoded from descriptive standards such as AACR2, and the way this information is displayed can be customized. To encode information means to make it machine-readable. Institutions or individuals that want to exchange records can do so if they are using the same encoding standard or if a method has been developed to convert one standard format to another. Sharing records increases their accessibility for people who are trying to find information. Standardized encoding formats include, for example, Machine-Readable Cataloging (MARC) and Standard Generalized Markup Language (SGML), which allow data to be displayed in human languages. The MARC format is the oldest encoding standard and is used in many libraries. Markup languages such as SGML permit the structures of many different types of documents to be encoded. They show which elements are structural elements (e.g., a paragraph or a title) and which elements are content elements (e.g., the sentences in the paragraph). In addition, standards can be used to describe each other. For example, MARC records can be encoded with SGML.

Conclusion

Cataloging and knowledge organization systems have been developed to make it easier for people to find what they need within the complex worlds of information and knowledge. These systems are used in all kinds of information environments to improve access to actual and virtual documents in many formats, in many languages, and from many periods of history. The evolution of these systems is ongoing because information professionals are constantly striving to improve access for users of the systems.

Bibliography

Chan, Lois Mai. (1994). Cataloging and Classification: An Introduction, 2nd edition. New York: McGraw-Hill.

Cutter, Charles Ammi. (1904). Rules for a Dictionary Catalog, 4th edition. Washington, DC: U.S. Government Printing Office.

DESIRE Consortium. (2000). "Welcome to the DESIREProject." <http://www.desire.org>.

Dublin Core Metadata Initiative. (2001). "Overview." <http://www.dublincore.org>.

Getty Research Institute. (2001). "Art & Architecture Thesaurus Browser." <http://www.getty.edu/research/tools/vocabulary/aat/>.

Hensen, Steven L., comp. (1989). Archives, Personal Papers, and Manuscripts: A Cataloging Manual for Archival Repositories, Historical Societies, and Manuscript Libraries, 2nd edition. Chicago: Society of American Archivists.

Joint Steering Committee for Revision of AACR.(1998). Anglo-American Cataloguing Rules, 2nd edition. Chicago: American Library Association.

Lancaster, Frederick W. (1998). Indexing and Abstracting in Theory and Practice, 2nd edition. Champaign: University of Illinois, Graduate School of Library and Information Science.

Library of Congress, Cataloging Distribution Service.(1914-). Library of Congress Subject Headings. Washington, DC: Library of Congress.

Library of Congress, Cataloging Policy and Support Office. (1902-). Library of Congress Classification. Washington, DC: Library of Congress.

Library of Congress, Network Development and MARCStandards Office. (2001). "Dublin Core/MARC/GILS Crosswalk." <http://www.loc.gov/marc/dccross.html>.

Library of Congress, Network Development and MARCStandards Office. (2001). "Encoded Archival Description." <http://www.loc.gov/ead/ead.html>.

McKiernan, Gerry. (2001). "Beyond Bookmarks:Schemes for Organizing the Web." <http://www.public.iastate.edu/~CYBERSTACKS/CTW.htm>.

Mills, Jack, and Broughton, Vanda, eds. (1977-). Bliss Bibliographic Classification, 2nd edition. London: Butterworths.

Milstead, Jessica L. (1984). Subject Access Systems: Alternatives in Design. Orlando, FL: Academic Press.

Mitchell, Joan S.; Beall, Julianne; Matthews, Winton E., Jr.; and New, Gregory R., eds. (1996). Dewey Decimal Classification and Relative Index, 21st edition. Albany, NY: Forest Press.

Ranganathan, Shiyali Ramamrita. (1964). Colon Classification, 6th edition. Bombay, India: Asia Publishing House.

Vickery, Brian C. (1997). "Ontologies." Journal of Information Science 23(4):277-286.

Clare Beghtol

Encyclopedia of Communication and Information