Retrieval of Information
Retrieval of Information
RETRIEVAL OF INFORMATION
Information Retrieval (IR), has been part of the world, in some form or other, since the advent of written communications more than five thousand years ago. IR has as its domain the collection, representation, indexing, storage, location, and retrieval of information-bearing objects. Traditionally these objects have been text-based documents such as articles, reports, and books; however, as multimedia computing has progressed, the list of information-bearing objects has grown to include such things as images, videos, maps, sound, and music recordings. In the modern sense of the term, IR has its roots in the scientific information explosion that accompanied, and followed, World War II. Predating the computer, early modern IR systems used ingenious manual mechanisms to deal with the millions of scientific and technological research papers that were being written as part of the war against tyranny.
The remarkable growth in computerization has lead to a "chicken-or-the-egg" scenario with regard to the growth in IR research and design. As computers become more powerful, more information is generated. As more information is generated, the need for bigger, better, and faster IR systems increases. This need in turn spurs the need for bigger, better, and faster computers, and so on, ad infinitum. Until the mid-1990s, interest in IR was limited to a handful of highly trained researchers, librarians, information scientists, computer scientists, and engineers. With the growing use of the Internet, however, information retrieval through the use of search engines has become an activity that is engaged in by a large portion of the general public. This newfound popularity is making IR research and development one of the hottest growth areas in the new digital "e-economy."
General Model of Information Retrieval
An IR system begins with a person who has some need for information. A student may need to find information for a class project on IR, for example. If there were just a little information available, then a purely manual approach might work. If the student knows that all of the needed information is available in a specific book or encyclopedia, and the student knows where that resource is, then he or she can just go get it. There is no need for any retrieval system beyond the rather remarkable one in the student's head. What happens, however, when the student does not know where to get the information? Perhaps an appropriate book exists in the library. Perhaps more is available in an online database or on a CD-ROM. Finally, perhaps some useful information may even be available on the World Wide Web. The problem is that in all of these cases the student cannot just wander around, hoping to find the needed information. These information collections are all much too large for that. Fortunately, for hundreds or thousands of years, people have been making retrieval tools to help with this task. Since the 1960s, many of these tools have involved computer applications. The following discussion will focus on these computer-based systems, with occasional reference to the earlier manual systems.
FIGURE 1. Information retrieval model.
A general IR system has the following eight components (each with a particular function): (1) a query interface, (2) a query processor, (3) an index,(4) an index engine, (5) one or more collections of information, (6) a result list, (7) a result display, and, most important, (8) a user or "searcher" of the system. The basic relationships among these elements of an IR system are shown in Figure 1.
The student, who might more generally be called the "user," interacts directly with the query interface and the result display. Users tell the IR system what they need through the query interface. A query is the string of words that users input to tell the computer what they need. The retrieval system interface provides a place for users to write the request. It would be easiest if users could speak a request in their native language. Unfortunately, this would require a special kind of artificial intelligence that has not yet been developed for computers. Users must tell the information system what they need in a language that the computer can understand. Usually, this is a language composed of the words that the user thinks will be in the documents that they are trying to find. In some systems, these words can just be presented in a list. Other systems fake natural language queries by allowing users to enter natural language word strings from which the system strips out all but the nouns. Most systems, however, require that the user connect the words together with the words "and," "or," or "not." This is called Boolean logic. So, the student searcher might look for information on the topic of "information retrieval" by typing "((information OR data) AND retrieval). "
While this language is simple, the computer still needs to figure out what to do with the string of words. The part of the system that does this is the query processor. The query processor breaks the query, (information OR data) AND retrieval, into a set of simple operations. The processor searches for "retrieval" in the collection first. If no documents are found with this word, the search may stop, since the query states that the document that the person is looking for must contain the word "retrieval" plus something else. If it cannot find "retrieval," then there is no point in looking for anything else. If documents are found that contain the word "retrieval," then the system will search to see if those documents also contain either the word "information" or the word "data." Those documents (if any) that do contain one of those words in addition to the word "retrieval" then become part of the retrieved set.
How does the system know which documents contain a word and which do not? While computers are fast, they are not fast enough to read all of the documents in a collection to see of a word is there every time a person asks. For example, if the World Wide Web were searched this way, the computer would need to read through millions of documents on millions of computers for each query. A better approach is to create an index file.
In full-text retrieval systems, an index file is an organized list of words. Each entry in the list contains a word and points to all of the documents that contain that word. It is something like a set of thousands of index cards. If they are unsorted, it would take a long time to find any one card, but if they are sorted into alphabetical order, then the search is much quicker. Once the correct entry is found in an index, it is relatively simple to read off the list of names of documents that contain the word.
There is a set of important issues related to the words that go into the index. This includes word selection and normalization. Indexing can be accomplished manually or by the computer. Computer indexing is generally faster but less precise, while manual indexing takes a long time, is generally of high quality, but is still fraught with problems such as inconsistency between indexers' choice of words. Vocabulary chosen by the indexers may not match that used by the individual searchers who are looking for information. Users must guess what terms are in the index, unless there is a term thesaurus.
An indexing engine is software that creates index files. It does this by processing the documents in the collection one by one. It takes each word from the document and tests to see if it should be included in the index. Some words are used so frequently in texts that they are not of much use for indexing. Words like "the" appear many times in all documents, so there is no point in including them in the index. The group of words that are to be ignored during indexing is called the "stop list." Sometimes the index engine performs additional processing on the words for the index. For example, the index engine may stem the words. Stemming is the process of removing endings from words that might otherwise tend to hide the similarity between words. For example, a person might be searching for "computing" but would be happy to match words such as "computer," "computers" and "computed." Stemming would reduce them all to "comput*." The indexing engine may perform other operations as well, such as converting all words to lowercase. After all of this processing, the index engine will make an entry in the index file for each word that it encounters in the documents and create a pointer back to the documents where the words occur.
Result Lists and Result Display
When a computer system combines all of the result lists of documents that match each of the search terms in the query, it displays a list of the names of the documents, or some sort of document surrogate, such as a brief description of the document. In traditional displays, the system will present a list that contains the document title, author, name of the publication, where the document came from, and in some systems, an abstract. This list must be presented in some order that is convenient for the user. Short lists can be presented in alphabetical order. Many result lists, however, particularly on the World Wide Web, tend to be very long, sometimes containing hundreds or even thousands of document names. Since only ten to twenty records at a time can appear on a computer screen, because of screen size limitations, this list may require the user to look though many screens to find a desired document.
Since the length of these lists makes alphabetical listings less than useful, it is typical in IR systems to present the list in an order that reflects how well the particular document matches the query. One simple way to do this is to count the number of words in the query that match the document. In the example, "((information OR data) AND retrieval)," some documents may match all three words while others may match only two words. Another ordering technique is to count the number of times the query words occur in the documents. For the sample query, if one document contains the word "retrieval" once and another contains the word "retrieval" three times, the document containing the word three times might be considered a better match and would be put closer to the front of the list.
An IR system searches for information in a collection of documents. These documents can take many forms. They might be individual articles from a scientific journal, newspaper stories, e-mail messages, poems, or encyclopedia entries. What all of these collections have in common is that they are composed of text.
The first and most important step in retrieving information is for the searcher to decide on what collection to search for the information. Collections are put together for different purposes. There are collections of news stories, collections for branches of science, collections of poems, and collections of web-pages. In short, there are collections for nearly every possible topic. No one index covers all of the topics. Collections may also exist in a variety of formats. Some collections are of music, sound, photography or video. The query example used above was based on a text-based collection. Usually, collections of sound or pictures are indexed using words that have been added to the documents to describe them. Alternatively, music can be indexed by its notes, sound by its dominant frequencies, images by their dominant colors or color layout, and video by the key images extracted from the motion. All of these indexing techniques are content-based methods. This contrasts with concept-based methods that are usually provided by humans and are more closely tied to the actual meaning of the document. In content-based methods, automatic computer programs are used to extract not the meaning but just some patterns from the documents. So, for example, a program must find the distribution of the most common colors in a photograph. These colors would be used to index the image. This, of course, changes the way in which a searcher must specify a query. It is difficult to type in a list of words that describes a photo-graph's distribution of color. Instead, some of these systems allow a person to search by drawing, by painting, or by selecting examples that are like the one that they want.
After the IR system selects a set of documents, the selections are displayed to the user. The format of the display varies according to what part of a document is displayed, which documents are given prominent positions, and how the full text is displayed if it is available. The common form of display is a list of matching document identifiers. In a magazine, journal, or newspaper collection, the document identifiers might be the author, article title, name of the publication, and the date of publication. Only a small number of document identifiers, usually about ten, can be displayed on a computer screen at one time. If there are more then ten items in the retrieval set, then there is usually a button or command that will display the next group of ten items. If the collection contains the full text for an item, the computer interface usually provides a way to bring the full text onto the computer screen or to print it.
Frequently, a user's first search does not produce the information that the user wanted, there is not enough information, or the list of results is too long. In each of these cases, the user may choose to reformulate the query, perhaps by using some new knowledge that was gained by looking at the results of the earlier search. This search and retrieval cycle continues until the user either finds the desired information or gives up. All too often, people do not find the information that they want. Even worse, they sometimes believe that they have found all of the relevant information when they have not. Many new systems are being developed with advanced features that will help remedy this situation.
Advanced Search Features
Each of the components of the IR system as discussed above can be improved, particularly the query interface, the query processor, the index, the collections, and the result display.
Natural language query systems allow people to use everyday language, rather than Boolean logic, to state what they are interested in finding. The Boolean query "(information OR data) AND retrieval" could be written as "give me all records about information or data retrieval." While this may look like a good alternative to Boolean queries, some caution is warranted. Computers have limited ways of processing human languages, so it is easy to be misled about what a computer will do with a statement like the one given above. The computer is likely to select only the nouns from the sentence, remove words like "record" and "document," and then "OR" the nouns together into "information OR data OR retrieval." A cleverly written program might figure out the correct use of the "and" and the "or" in this situation, but such proficiency is the exception rather than the rule for most computer-based IR systems.
Vocabulary aids can improve retrieval. It is sometimes difficult for users to guess what words were used by the author or the indexer to describe the topic that is of interested. Consider what would happen in the above example if the author used the words "database retrieval" and not "data retrieval." Neither the Boolean query processor nor a natural language search would be able to find the record if the searcher used the term "data retrieval" instead of the term "database retrieval." To help with this problem, some systems provide either a term thesaurus or a controlled vocabulary. A term thesaurus can be provided in a separate printed book, or it can be built into the retrieval system. A term thesaurus lists what words can be used in place of other words in a query. This type of thesaurus has entries such as "DATA STORAGE use DATABASE" to tell the searcher, in this case, to use the word "database" rather than "data" when performing a search. The thesaurus may also suggest broader and narrower words. For example, "DATABASE broader term is DATA MANAGEMENT" or "DATABASE narrower term is RELATIONAL DATABASE MANAGEMENT SYSTEM (RDBM)". A thesaurus may also be used for automatic query expansion. For example, when a person searches for "data," the system might automatically add the word "database" to the query without asking the person.
Another feature that is available in more advanced systems is relevance feedback, which allows users to incorporate previously retrieved documents into a query. Users mark the documents as "relevant" (i.e., similar to what they are wanting) or "irrelevant" (i.e., not at all what they are wanting). The query processor uses this information to form a new search for the user. The query processor might do this by adding keywords from the relevant documents to the query and adding "not" to the keywords from irrelevant documents.
When most information retrieval systems return a list of retrieved documents, the only way to indicate potentially "good" documents is to put them at the top of the list. Using computer graphics, some systems can give the person doing the search more information about why a particular document was returned. One such system is called Web VIBE. In this system, queries may be broken into parts at the "or" in Boolean queries. For example, one query might be "information AND retrieval," and at the same time, the user can search for "data AND retrieval." Each query is represented as a "magnet" on a different part of the screen. The documents that match the queries are represented on the computer screen as icons, such as small images of pieces of paper. The magnets attract icons that represent documents that match the query. If a document icon matches only one of the queries, the icon for that document will fall directly on the magnet. The more interesting case is when a document matches both queries. In that case, the magnets that represent the queries have a "tug-of-war" over the document icon, and it will end up on the computer screen somewhere between the two magnets. The searcher can see which parts of a query best match a document by the position of the icons on the computer screen. Contrast this with the standard result list format, which simply lists relevance-sorted matches to the query "(information OR data) AND retrieval." In this case, the user cannot tell if a document was returned because it matched "information," "data," or both "information" and "data."
Evaluation of Information Retrieval Systems
Information scientists are continually arriving at new ways to improve IR, requiring the evaluation of IR systems to become an important component of IR research and development. In order that the comparisons be as consistent as possible, IR researchers have developed the "Cranfield Model" of evaluation. The Cranfield Model is named after a series of experimental IR evaluations performed by Cyril Cleverdon and his colleagues in Cranfield, England, during the 1960s. The second of the Cranfield studies (Cranfield II) has been called "the exemplar for experimental evaluation of information retrieval systems" (Harter and Hert, 1997). Thus, the Cranfield experiments are considered by most IR researchers to be the progenitor of the discipline of IR evaluation. The Cranfield Model can be described as
- a test collection of documents,
- a set of queries, and
- a set of relevance judgements (i.e., lists, for each query, of all of the relevant documents in the collection).
For a given query and act of retrieval, the following document sets and associated numbers are known:
- relevant and retrieved documents (RelRet),
- nonrelevant and retrieved documents (NRelRet),
- relevant and nonretrieved documents (Rel Nret), and
- nonrelevant and nonretrieved documents (Nrel Nret).
Ideally, for any given query, an IR system should find for the user all relevant, and only relevant, documents. To determine how close to this ideal an IR system is working, a collection of IR evaluation measures, or metrics, have been developed. Two metrics, recall and precision, are the most commonly used measures. Recall evaluates, for any given query, how close an IR system comes to retrieving all the relevant documents in its database. Precision evaluates, for any given query, how close an IR system comes to presenting to the user only relevant documents. Recall and precision are both expressed either as a number from 0 to 1, or as a percentage. A score of 0 (0%) represents complete failure, while a score of 1 (100%) represents perfection. An ideal system would, therefore, have a recall score of 1 (100%) and a precision score of 1 (100%) for each and every query submitted. The recall and precision metrics can be expressed as easy-to-calculate equations:
Recall= RelRet/(Relret + RelNret)= A/(A+C)
Precision= RelRet/(RelRet + NrelRet)= A/(A+B)
Imagine that a test collection contains 100 documents that are relevant to the query "What is information retrieval?" Further imagine that when this query is submitted to the system, 25 documents are retrieved. After examining all 25 retrieved items, it is determine that 20 of them are relevant to the original query. In this case, the recall for the search is 0.2 (20%) and the precision is 0.8 (80%). These scores were calculated:
Recall= RelRet/(RelRet + RelNret)= 20/100=
Precision= RelRet/(Relret + NrelRet)+ A/(A+B)
= 20/25= 80%
The spirit of the Cranfield evaluations lives on in the form of the Text REtrieval Conferences (TREC), which started in 1992 as a joint IR evaluation project between the National Institute of Standards and Technology (NIST) and the Defense Advanced Research Projects Agency (DARPA). The TREC evaluations are famous for the depth and breadth of the assigned retrieval tasks on which the entered IR systems are evaluated.
The evaluation of IR system performance is more of an art than a true science. The accurate calculation of recall scores is just one of the many problems that IR evaluators must overcome in their quest to be scientific. Note how the calculation of recall is predicated on knowing exactly how many relevant documents there are in the database for each query. When evaluators use a test collection, this is not a problem because previous researchers have examined every document in the collection to determine whether a given document is relevant to a given query. In real-world IR systems (e.g., AltaVista), systems with millions of documents, the examination of each document to determine its relevance to a query is impossible. Various techniques have been used to estimate the number of relevant documents that are present in a collection. For example, in their evaluation of a real-world legal collection, David Blair and M. E. Maron (1985) used a method of iterative searching to uncover potentially relevant documents that had initially been missed by a set of legal researchers. Using this technique, they revealed two important facts. First, users consistently overestimate the recall scores of their initial searches. Second, recall scores, in general, were shockingly low. Because the Blair and Maron approach is time-consuming and labor-intensive, other researchers have attempted to estimate the number of relevant documents by randomly sampling unretrieved documents. Still others have compared results of a series of similar searches and used the documents that were found to be in common as the basis for their estimates. Notwithstanding the estimation technique that is employed, it is important to stress that these methods provide only rough estimates of the actual number of relevant documents in a collection. Because real-world recall scores are based on such estimates, they too are considered mere estimations.
Since IR systems are developed by, and used by, humans, all of the problems associated with measuring the myriad variety of human behaviors also apply to IR evaluation. Questions of experimental validity are often raised. Are evaluators really measuring system performance, or are they measuring the IR skills of the users? Human users come to a system with a wide range of skills and abilities that must be accounted for if one is to determine whether the observed retrieval scores are the result of system performance or user skill. Experimental reliability is another problem. How well does the experimental evaluation of a system predict its behavior in the "real world"? It is one thing to claim that a system works well under laboratory conditions, but it is quite another to claim that it will work just as well once an IR system has been released to the public for general use.
A classic example of the reliability issue is the "recall-precision tradeoff" problem. Some users, such as doctors, lawyers, and graduate students, often need to conduct exhaustive searches (e.g., "I need all legal decisions handed down concerning malpractice lawsuits."). Exhaustive searches are those where each and every relevant document must be retrieved and then examined to ensure that no important information has been overlooked. Obviously, an IR system would need to have perfect recall to satisfy such users. Other users (even doctors, lawyers, and graduate students) occasionally need to conduct question-answering searches (e.g., "What year was Roe v. Wade decided?"). Specific searches, as this type of quest is sometimes called, require only that a small set of documents be retrieved. So long as one or more of the documents contains the desired answer, the size of the retrieved set is immaterial because once an answer-bearing document is retrieved and the answer is found, the remaining documents are no longer needed. Under this scenario, users with question-answering searches would be best served by an IR system with high precision. In a perfect world, the ideal IR system would be able to satisfy both types of uses simultaneously. Research has consistently shown that when users of IR systems aim for high recall, precision suffers significantly; similarly, when precision is the goal, recall performance suffers. This inverse relationship between recall and precision is the recall-precision tradeoff.
Since system evaluators cannot know all the different types of queries that an IR system might be called on to address, some researchers have questioned the meaningfulness of recall and precision scores as measures of true IR system performance. For example, how meaningful is a low recall score when a user only needs one document to answer a query? To overcome the problems posed by the recall-precision tradeoff, some IR researchers have proposed the adoption of special metrics that combine aspects of both recall and precision simultaneously.
Another difficulty that plagues the scientific examination of IR systems is what is meant by the term "relevance." Most researchers agree that relevant documents are those that somehow "satisfy" the query that is presented by a user. The debate hinges on the idea of "satisfaction." For some, the question is black and white—either the documents in question answer the query, or they do not. For others, satisfaction comes in shades of gray, with some documents being more relevant than others. Into this contentious mix is often thrown the notion of "pertinence," or the usefulness of the documents irrespective of relevance. Pertinent documents are those that a user finds useful. Pertinence and relevance are closely related, but they are not necessarily the same. Think of pertinence, like beauty, as being in the eye of the beholder. For example, a document retrieved about astrophysics might be relevant in response to a student's query about solar eclipses; however, the same document might also be nonpertinent to the student because the complex mathematical formulas that are presented within the document make the text incomprehensible. This debate over the nature of relevance might appear esoteric to an outsider, but it is important to IR researchers. Note the central role played by the counting of relevant documents in the calculation of both recall and precision. If one researcher decides to calculate precision by using the black-and-white interpretation of relevance and another decides to use pertinence, two very different sets of information will be generated. Since these two sets of results purport to evaluate the same thing, it is obvious how the relevance problem can lead to considerable confusion and uncertainty.
In reaction to the many problems that are associated with the accurate measurement of recall and precision, a new group of researchers has emerged, and they are called the "user-centered" school. User-centered IR evaluators believe that the Cranfield style of evaluation, along with its reliance on recall and precision, is fundamentally flawed because it is too "system centered" and thus does not capture the most important aspect of the interaction between users and IR systems, namely the successful informing of the users. Under the user-centered theory, an IR system is only successful if it assists users in moving forward with their lives by quickly, and with minimum of effort, fulfilling their broadly defined information needs. User-centered researchers perform qualitative assessments of achievement using a mixture of personal interviews, focus groups, surveys, and observations, and they eschew the quantitative measures of recall and precision.
Multimedia Information Retrieval
If one thinks of the development of modern computerized IR systems as forming a family tree, then in some sense the early online public access catalogues (OPACs) that are found in libraries would be the grandparents to all subsequent systems. Their children, nicely matured, would be the text-based search engines that are now found just about everywhere, including the World Wide Web, CD-ROM workstations, and palmtop personal digital assistants (PDAs). The teenaged grandchildren, full of promise but not quite ready to leave home, would be the visual information retrieval (VIR) systems that are designed to provide access to image and video information. To complete this family tree, the youngest grandchildren, just barely toddlers, would be the music information retrieval (MIR) systems and their cousins, the auditory information retrieval (AIR) systems. MIR provides an illustrative example of the challenges that are faced by multimedia information retrieval (MMIR) as it grows to maturity.
Interest in MMIR is growing rapidly. The advent of powerful multimedia-capable home computers and the point-and-click ease of the Internet have combined to create a huge demand for nontext information, including pictures, maps, speech, video, and music recordings. For example, according to Word Spot (2000), an Internet consulting company that tracks queries submitted to Internet search engines, the search for music—specifically, the now-popular MP3 format—has surpassed the traditional search for sex-related materials as the most popular search engine request.
Why is the obvious demand for music and other multimedia information not being met by IR research and development? Why are the "MP3 search engines" merely indexing the textual labels supplied by the creators of the files and not the music itself? MMIR is fraught with difficult problems that must be overcome before the technology can mature. First, multimedia information objects (e.g., images and music recordings) tend to be large. A single minute of CD-quality music takes up about 10 MB. These large sizes make multimedia information difficult to transmit, process, and store. Second, and more problematic, multimedia information is multifaceted (i.e., it has many different components that together make up the information). Music information can be said to comprise a complex array of pitch (e.g., notes), temporal (e.g., rhythm, meter), harmonic (e.g., chords), timbral (e.g., tone color), editorial (e.g., performance instructions), textual (e.g., lyrics), and bibliographic (e.g., composer, publisher) information, which together all form a coherent whole. The fact that the facets themselves have many different ways of being represented compounds this complexity to dizzying heights. For example, the pitch facet can be represented by note name (e.g., e-flat, g-sharp), interval (i.e., distance between notes), solfège (e.g., do, re, mi, fa, sol, la, ti), and sonic frequency, to name but a few options. MIR and other MMIR researchers must find comprehensive methods of dealing with this awe-inspiring complexity before MMIR systems can achieve the success of their text-based ancestors.
The goal of every MMIR system is to provide the user with the ability to search for and then retrieve multimedia objects using the medium of the desired objects. Thus, in the case of music information, MIR systems are being designed so music queries can be posed musically. Music-specific query methods that are being developed and refined include singing, humming, notation (i.e., traditional note and staff), pitch names, intervals, melodic contour (i.e., the overall shape of a melody), music keyboard, and so on. Each of these methods has its strengths and weaknesses. For example, contour queries are forgiving of user errors because they only have to remember the general shape of a melody, rather than the exact pitches. This approach, however, can result in retrieval of many unwanted songs (i.e., high recall but low precision). Using a music keyboard to input the exact pitches does offer the user the opportunity to achieve more precise results, but many users are not trained to play a music keyboard, so the chance for error is greatly enhanced. It is interesting to note that the recall-precision tradeoff of traditional text-based IR appears to have been handed down in full measure to its MMIR descendants.
Blair, David C., and Maron, M. E. (1985). "An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System." Communications of the ACM 28(3):289-299.
Harter, Stephen P., and Hert, Carol A. (1997). "Evaluation of Information Retrieval Systems: Approaches, Issues, and Methods." Annual Review of Information Science and Technology 32:3-91.
Korfhage, Robert. R. (1997). Information Storage and Retrieval. New York: Wiley.
Pollit, A. S. (1989). Information Storage and Retrieval Systems: Origin, Development, and Applications. New York: Wiley.
Salton, Gerry, and McGill, Mike J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.
WordSpot. (2000). "Top 500 Custom Pick Report Example: MP3."<http://www.wordspot.com/samplecustom.html>.
P. Bryan Heidorn
J. Stephen Downie