Transaction-Generated Information and Data Mining

views updated

TRANSACTION-GENERATED INFORMATION AND DATA MINING

The term transactional information was first employed by David Burnham (1983) to describe a new category of information produced by tracking and recording individual interactions with computer systems. Unlike most human interactions, those processed by computer systems are easily recorded and aggregated to yield knowledge about individual behaviors that would have otherwise been more difficult to acquire and often less complete. Known as transactional-generated information (TGI), it is information acquired from commercial and noncommercial transactions involving individuals in many increasingly computerized day-to-day activities. Examples of commercial transactions include withdrawing money from an ATM machine or credit-card shopping; examples of noncommercial transactions include checking books out of a library or participating in an online educational program. TGI can be contrasted with but does not exclude more traditional information such as a person's age, place of birth, education, work history, and so forth.

The Special Character of TGI

The practice of collecting information about persons is hardly new. Governments have collected census data since the Roman era. But through the twentieth century, the few records that existed about individuals contained information about when and where they were born, married, worked, or owned property. Information about the day-to-day transactions of individuals was rarely, if ever, collected and stored. Even if it had been collected, it would have been difficult to process and store. Armies of clerks would have been needed to sort through this information and huge warehouses or repositories would have been required to store the physical records. Those conditions changed, of course, with the advent of computers and electronic databases.

Additionally much traditional information about persons is gathered in ways that require conscious acts of disclosure on the part of those providing it. When individuals fill out census forms, they are generally aware of providing information about themselves to a government agency. By contrast, with TGI data subjects are not always consciously aware they are providing information about themselves to some data collector. When motorists use the convenience of an Intelligent Highway Vehicle System, such as E-ZPASS, they seldom realize that a transaction occurs each time they pass a toll plaza. Not only is a motorist's pre-paid account with E-ZPASS debited, but the exact time of passing through the toll booth is electronically recorded and stored.


Cookies

Next consider a kind of on-line transaction involving typical Internet users, who may have no knowledge that TGI is being collected. Via programs called cookies, TGI is routinely gathered about users who visit web sites. Cookies technology enables web site owners to collect certain kinds of data about users who access their sites, including information about the user's Internet Protocol (IP) address and Internet Service Provider (ISP). This information is stored in a text file placed on the hard drive of the user's computer and then retrieved from that computer and resubmitted to the web site the next time the user accesses it. It provides the operator of a web site with information about a user's on-line browsing preferences. Transactions involving the use of cookies to exchange data between users and web sites typically occur without the knowledge and consent of users.

Since their implementation on the web in the 1990s, the use of cookies technology has been controversial. The owners and operators of on-line businesses and Web sites, who defend the use of cookies, claim that they are performing a service for repeat users of a web site by customizing a user's means of information retrieval. For example, they point out that cookies technology enables them to provide a user with a list of preferences for future visits to that Web site. Defenders of cookies also note that users can elect to disable cookies via an option provided on their web browsers.

Privacy advocates, on the other hand, argue that because cookies technology involves the monitoring and recording an individual's activities while visiting a Web site, as well as the subsequent downloading of that information onto a user's PC (without informing the user), the use of cookies clearly cross the privacy line. They also point out that many web sites do not permit users to disable cookies, and they note that users must first be aware of cookies before they can opt out (i.e., reject cookies) on web sites that allow them to do so. Some privacy advocates also worry that information gathered about a user via cookies can eventually be acquired by on-line advertising agencies, which could then target that user for on-line ads.

Merging and Mining TGI

Because TGI exists in the form of electronic records, it can be easily exchanged between databases in a computer network; these records can also be merged. Computerized merging is the technique of extracting information from records about individuals (or groups of individuals) that reside in two or more databases, which are often unrelated, and then integrating that information into a composite file.

Information gathered about an individual's on-line activities and preferences via Internet cookies can also be merged with information about an individual's transactions in off-line activities in physical space to construct a general profile. In 1999 DoubleClick.com, an on-line advertising firm that used cookies technology to amass information about Internet users, proposed to purchased Abacus, an off-line database company. DoubleClick's pending acquisition of Abacus was criticized by many privacy advocates who feared that the on-line ad company would combine the information it had already acquired about Internet users (via cookies) with the records of some of those same individuals that resided in the Abacus database.

DoubleClick would have been able to merge web profiles with off-line transactional data about consumers. In January 2000, however, DoubleClick was sued by a woman who complained that her right to privacy had been violated by that company. The woman filing the suit claimed that DoubleClick's business practices were deceptive because the company had quietly reversed an earlier policy in which it provided only anonymous data about Internet users (acquired from cookies files) to businesses. Because of public pressure, DoubleClick backed off its proposal to purchase Abacus. However, because of the controversy surrounding the DoubleClick incident, many realized for the first time the kinds of privacy threats that can result from the merging of electronic data. And even though the DoubleClick-Abacus merger did not materialize, the danger of future mergers of this type remain.

In addition to being merged, TGI can also be mined. Data mining is a computerized technique used to reveal non-obvious patterns in data that otherwise would not be discernible. Data-mining technology also generates new classifications or categories (of individuals), which are not always obvious to the individuals who populate them. Some of these newly discovered/created categories or groups suggest new facts about individuals who constitute these groups. For example, a young executive with an impeccable credit history could, as a result of data-mining technology, end up being identified as a member of a (newly generated) category of individuals who are perceived to be high-credit risks because of certain patterns found in aggregated data, despite the fact that the particular person's credit history is unblemished. That is, a data-mining program might associate the young executive with a group of individuals who are likely to start their own businesses in the next three years and then file for bankruptcy within the next five years.

Because of concerns about the ways in which electronic records can be exchanged between two or more databases, various privacy laws have been enacted at the federal and state levels. For example, the Health Insurance Portability and Accountability Act (HIPPA) of 1996, enacted into law on April 14, 2003, provides protection for personal medical records. And the Video Protection Act (also known as the "Bork Bill" because it was passed through the U.S. Congress in the aftermath of Judge Robert Bork's nomination to the U.S. Supreme Court) protects consumers from having records of their video rentals from being collected and exchanged. However, these laws primarily aim at protecting personal information that is: (a) explicitly identifiable in electronic records, and (b) considered intimate or confidential.

Information acquired via data mining fits neither category. First, as noted, it is derived from implicit patterns in data, which without data-mining technology, would not be accessible to data collectors. Second the kind of personal information generated in the data-mining process is often considered non-intimate or non-confidential because it is derived from information acquired through transactions in which individuals engage openly and in public places.

The use of courtesy cards in supermarket transactions might initially seem innocuous from the perspective of personal privacy. The items purchased are typically transported in an open shopping cart that is visible to anyone in the store so there is nothing confidential or intimate about the activity. However a record of courtesy card purchases can be used to generate a consumer profile. This profile reveals patterns that identify, among other things, the kinds of items purchased and the time of day/week an individual typically shops. Such information is useful to information merchants who use it to target consumers in their advertising and marketing campaigns. Furthermore information in a consumer profile can be used to make judgments about personal lifestyles, health, spending habits, and more. Indeed such a profile may be created even when the aggregated data on which it is based is inaccurate because the courtesy card was loaned to another person.

The new forms of information produced by TGI and data mining thus present special challenges to privacy. First individuals may not be aware of the degrees to which their activities are being tracked by a constellation of computer system interactions and their interactions analyzed by data mining techniques. The lack of knowledge in these regards is itself an ethical issue that deserves to be addressed by general education and disclosure statements associated with the particular computer systems. Second because it is easy for such TGI and data mining products to include inaccuracies that may have substantial if subtle impacts, it may be necessary to consider possibilities for personal review or disclosure when TGI is used to influence decision making.

HERMAN T. TAVANI

SEE ALSO Computer Ethics;Internet;Privacy.

BIBLIOGRAPHY

Burnham, David. (1983). The Rise of the Computer State. New York: Random House. One of the earliest accounts of how computer databases could be used by government agencies to store and exchange information gained about individuals from their electronic transactions.

Fulda, Joseph S. (2004). "Data Mining and Privacy." In Readings in CyberEthics, 2nd edition, ed. Richard A. Spinello and Herman T. Tavani. Sudbury, MA: Jones and Bartlett Publishers. This anthology comprises fifty readings; Fulda's article succinctly describes how privacy concerns involving data mining differ from other computer-related privacy issues.

Johnson, Deborah G. (2004). "Computer Ethics." In Academy and the Internet, ed. Helen Nissenbaum and Monroe E. Price. New York: Peter Lang. This anthology comprises twelve readings; Johnson's article provides an overview of computer-ethics issues, including a discussion of specific privacy concerns involving TGI.

Tavani, Herman T. (1999). "Informational Privacy, Data Mining, and the Internet." Ethics and Information Technology 1(2): 137–145. Examines how privacy issues arising from data mining differ from those associated with traditional data-retrieval techniques; also illustrates how mining data from the Internet differs from data mining in off-line contexts such as "data warehouses."

Tavani, Herman T. (2004). Ethics and Technology: Ethical Issues in an Age of Information and Communication Technology. Hoboken, NJ: John Wiley and Sons. Comprises eleven chapters that cover a wide range of computer-ethics issues, including an extensive discussion of data mining and cookies in Chapter 5: "Privacy and Cyberspace."