Data mining is the process of discovering potentially useful, interesting, and previously unknown patterns from a large collection of data. The process is similar to discovering ores buried deep underground and mining them to extract the metal. The term "knowledge discovery" is sometimes used to describe this process of converting data to information and then to knowledge.
Data, Information, and Knowledge
Data are any facts, numbers, or text that can be processed by a computer. Many organizations accumulate vast and growing amounts of data in a variety of formats and databases. These data may be loosely grouped into three categories: operational or transactional data, such as company sales, costs, inventory, payroll, and accounting; non-operational data, such as industry sales, forecast data, and macro-economic data; and metadata, which is data about the data themselves, such as elements related to a database's design or query protocol.
The patterns, associations, and relationships among all these data can provide information. For example, analysis of retail point-of-sale transaction data can yield information on which products are selling and when. Information can then be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items to combine with promotional efforts for the best sales or profit results.
Applications of Data Mining
Data mining is used today by companies with a strong consumer focus, such as retail, financial, communication, and marketing organizations. Data mining enables these companies to identify relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics . It enables them to determine what impact these relationships may have on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detailed transactional data and to find ways to apply this knowledge to improving business.
With data mining, a retailer can use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, retailers can develop products and promotions to appeal to specific customer segments. For example, Blockbuster Entertainment can mine its VHS/DVD rental history database to recommend rentals to individual customers, and American Express can suggest products to its cardholders based on an analysis of their monthly expenditures.
Data mining has many applications in science and medicine. Astronomers use data mining to identify quasars from terabytes of satellite data, as well as to identify stars in other galaxies. It can also be used to predict how a cancer patient will respond to radiation or other therapy. With more accurate predictions about the effectiveness of expensive medical treatment, the cost of health care can be reduced while the quality and effectiveness of treatment can be improved.
The data mining process is interactive and iterative, and many decisions are made by the user. Data mining is not an automatic process. It does not simply happen by pushing a button. Data mining requires an understanding of the decision-maker's intentions and objectives, the nature and scope of the application, as well as the limitations of data mining methods. Data mining is research. It is a process that requires one to develop knowledge about every task at hand, to research possibilities and options, to apply the best data mining methods, and to communicate the results in a comprehensible form. Armed with solid information, researchers can apply their creativity and judgment to make better decisions and get better results. A variety of software systems are available today that will handle the technical details so that people can focus on making the decisions. Most of these systems employ a variety of techniques that can be used in several combinations. Advanced techniques yield higher quality information than simpler ones. They automate the stages of information gathering to enhance the decision-making process through speed and easily understood results.
Techniques for Data Mining
Just as a carpenter uses many tools to build a sturdy house, a good analyst employs more than one technique to transform data into information. Most data miners go beyond the basics of reporting and OLAP (On-Line Analytical Processing, also known as multi-dimensional reporting) to take a multi-method approach that includes a variety of advanced techniques. Some of these are statistical techniques while others are based on artificial intelligence (AI) .
Cluster analysis is a data reduction technique that groups together either variables or cases based on similar data characteristics. This technique is useful for finding customer segments based on characteristics such as demographic and financial information or purchase behavior. For example, suppose a bank wants to find segments of customers based on the types of accounts they open. A cluster analysis may result in several groups of customers. The bank might then look for differences in types of accounts opened and behavior, especially attrition, between the segments. They might then treat the segments differently based on these characteristics.
Linear regression is a method that fits a straight line through data. If the line is upward sloping, it means that an independent variable such as the size of a sales force has a positive effect on a dependent variable such as revenue. If the line is downward sloping, there is a negative effect. The steeper the slope, the more effect the independent variable has on the dependent variable.
Correlation is a measure of the relationship between two variables. For example, a high correlation between purchases of certain products such as cheese and crackers indicates that these products are likely to be purchased together. Correlations may be either positive or negative. A positive correlation indicates that a high level of one variable will be accompanied by a high value of the correlated variable. A negative correlation indicates that a high level of one variable will be accompanied by a low value of the correlated variable.
Positive correlations are useful for finding products that tend to be purchased together. Negative correlations can be useful for diversifying across markets in a company's strategic portfolio. For example, an energy company might have interest in both natural gas and fuel oil since price changes and the degree of substitutability might have an impact on demand for one resource over the other. Correlation analysis can help a company develop a portfolio of markets in order to absorb such environmental changes in individual markets.
Factor analysis is a data reduction technique. This technique detects underlying factors, also called "latent variables," and provides models for these factors based on variables in the data. For example, suppose you have a market research survey that asks the importance of nine product attributes. Also suppose that you find three underlying factors. The variables that "load" highly on these factors can offer some insight about what these factors might be. For example, if three attributes such as technical support, customer service, and availability of training courses all load highly on one factor, we might call this factor "service." This technique can be very helpful in finding important underlying characteristics that might not be easily observed but which might be found as manifestations of variables that can be observed.
Another good application of factor analysis is to group together products based on similarity of buying patterns. Factor analysis can help a business locate opportunities for cross-selling and bundling. For example, factor analysis might indicate four distinct groups of products in a company. With these product groupings, a marketer can now design packages of products or attempt to cross-sell products to customers in each group who may not currently be purchasing other products in the product group.
Decision trees separate data into sets of rules that are likely to have different effects on a target variable. For example, we might want to find the characteristics of a person likely to respond to a direct mail piece. These characteristics can be translated into a set of rules. Imagine that you are responsible for a direct mail effort designed to sell a new investment service. To maximize your profits, you want to identify household segments that, based on previous promotions, are most likely to respond to a similar promotion. Typically, this is done by looking for combinations of demographic variables that best distinguish those households who responded to the previous promotion from those who did not.
This process gives important clues as to who will best respond to the new promotion and allows a company to maximize its direct marketing effectiveness by mailing only to those people who are most likely to respond, increasing overall response rates and increasing sales at the same time. Decision trees are also a good tool for analyzing attrition (churn), finding cross-selling opportunities, performing promotions analysis, analyzing credit risk or bankruptcy, and detecting fraud.
Neural networks mimic the human brain and can "learn" from examples to find patterns in data or to classify data. The advantage is that it is not necessary to have any specific model in mind when running the analysis. Also, neural networks can find interaction effects (such as effects from the combination of age and gender) which must be explicitly specified in regression. The disadvantage is that it is harder to interpret the resultant model with its layers of weights and arcane transformations. Neural networks are therefore useful in predicting a target variable when the data are highly non-linear with interactions, but they are not very useful when these relationships in the data need to be explained. They are considered good tools for such applications as forecasting, credit scoring, response model scoring, and risk analysis.
Association models examine the extent to which values of one field depend on, or are predicted by, values of another field. Association discovery finds rules about items that appear together in an event such as a purchase transaction. The rules have user-stipulated support, confidence, and length. The rules find things that "go together." These models are often referred to as Market Basket Analysis when they are applied to retail industries to study the buying patterns of their customers.
The Future of Data Mining
One of the key issues raised by data mining technology is not a business or technological one, but a social one. It is concern about individual privacy. Data mining makes it possible to analyze routine business transactions and glean a significant amount of information about individuals' buying habits and preferences.
Another issue is that of data integrity. Clearly, data analysis can only be as good as the data that is being analyzed. A key implementation challenge is integrating conflicting or redundant data from different sources. For example, a bank may maintain credit card accounts on several different databases. The address (or even the name) of a single cardholder may be different in each. Software must translate data from one system to another and select the address most recently entered.
Finally, there is the issue of cost. While system hardware costs have dropped dramatically within the past five years, data mining and data warehousing tend to be self-reinforcing. The more powerful the data mining queries, the greater the usefulness of the information being gleaned from the data, and the greater the pressure to increase the amount of data being collected and maintained. The result is increased pressure for faster, more powerful data mining queries. These more efficient data mining systems often cost more than their predecessors.
see also Database Management Software; Data Warehousing; E-commerce; Electronic Markets; Privacy.
Berthold, Michael, and David J. Hand, eds. Intelligent Data Analysis: An Introduction. Germany: Springer-Verlag, 1999.
Fayyad, Usama, et al. Advances in Knowledge Discovery and Data Mining. Boston, MA: MIT Press, 1996.
Han, Jiawei, and Micheline Kamber. Data Mining: Concepts and Techniques. San Diego, CA: Academic Press, 2001.
"DB2 Intelligent Miner for Data." IBM's Intelligent Miner. IBM web site. <http://www-4.ibm.com/software/data/iminer/fordata/about.html>
Hinke, Thomas H. "Knowledge Discovery and Data Mining Web References." Computer Science Department Web Site. University of Alabama Huntsville. <http://www.cs.uah.edu/~thinke/Mining/mineproj.html>
In the Information Age, it's not how much information is maintained but how it is managed, manipulated, and exploited that can make or break a firm. Data mining is the practice of ferreting out useful knowledge from the wealth of information stored in computer systems, databases, communications records, financial and sales data, and other sources. A staple in the so-called Information Economy, data mining has evolved into a standard—and often requisite—business practice, and is often as valuable to firms as their underlying products or services. With competition heating up and making use of the mountains of new information technologies, those most able to exploit data mining to derive insights for use in a business model or strategy are often those with a competitive edge.
Data mining combines expertise in data analysis with sophisticated pattern-searching software to crunch diverse mountains of data and churn out information designed to capture market share and boost profit margins. As the sheer wealth of information available escalated through the 1990s and early 2000s, such techniques assumed paramount importance. The focus of data mining is on organizing data and identifying patterns that translate into new understandings and viable predictions. Companies thus try to use data mining to discover relationships between data and phenomena that ordinary operations and routine analysis would otherwise overlook, and thereby identify squandered opportunity, redundancy, and waste.
Data mining combines features of various disciplines, particularly computer science, database management, and statistics, to map low-level data into more advanced and meaningful forms. In its truest form, data mining is part of the broader knowledge discovery from data (KDD) process, although the terms are often used interchangeably. KDD refers to the entire process of data warehousing, organization, cleansing, analysis, and interpretation. Colloquially, however, data mining stands for this entire process of deriving useful knowledge, using computational systems, from massive amounts of data.
Data-mining software systems are generally based on a combination of mathematical algorithms designed to seek out and organize information by variables and relationships. For instance, one common algorithm is called recursive partitioning regression (RPR). RPR processes all the variables chosen for a particular set of data and parses them for their explanatory power, that is, for the degree to which they account for variations in the data. In sifting through customer profiles, for example, the algorithm would isolate information such as personal incomes, education levels, sex, and so on.
The data-mining process is divided into three stages: data preparation, data processing, and data analysis. In the first stage, the data to be mined is selected and cleared of superfluous elements in order to streamline mining. In the second stage, the data is run through the algorithms at the heart of data mining, and characteristics and variables are identified and categorized, thereby transforming the data into broader, more meaningful pieces of information. In the final stage, the extracted information is analyzed for useful knowledge that can be applied to a business strategy.
Data-mining software was first developed in the late 1960s and 1970s as a way of tracking consumer-purchasing habits. Over the years, the application of data mining extended beyond retail to encompass larger-scale business practices, and was combined with advances in database management, artificial intelligence, computers, and telecommunications to constitute extremely powerful tools for knowledge extraction.
Traditionally, data mining was used primarily for categorized information; in other words, techniques and tools were designed to find relationships and patterns in masses of data that were already segmented into different categories via structured databases, such as a customer's age and residence. Later techniques greatly expanded the power of data mining by allowing for mining of unstructured text documents, such as e-mails, customer requests, and Web pages. In this way, data mining applies structure to loosely organized data, and highlights valuable information that might otherwise be missed. Moreover, this allows for the relevant extraction of information from documents that were assembled for any purpose, rather than specifically for the issue at hand, thereby increasing the efficiency of data flow and preventing the waste of potentially valuable information. This technique, known as text mining, creates a database of words that can be categorized and a sophisticated search engine to seek out those words and related alternatives.
Many times the first step toward data mining is building a data warehouse, or a vast electronic database to contain and organize the wealth of information collected. Without a data warehouse, companies lack the infrastructure to mine useful knowledge out of the data available. Like word processing programs and computer operating systems, data mining has grown more user-friendly and graphics-based as its application has spreads throughout society to less technically inclined users. Software programs increasingly feature visualization techniques to dramatize specified data, relationships, and patterns.
Data mining has become a crucial component of customer management. The most common form of data mining begins with the accumulation of various kinds of customer profiles. These can take the form of simple names and addresses derived from other firms' customer lists and used for purposes of mass mailing, or they can constitute more sophisticated and comprehensive reports on consumer tastes and buying habits. Over time, firms amass great quantities of customer profiles through their own sales and through arrangements with other firms, and apply data-mining techniques to sift through them for clues as to how to adjust their strategies.
Whether to attract, service, or maintain customers, businesses position data mining at the cornerstone of customer relations. Using advanced data mining techniques, companies can determine what level of spending can be expected from a particular customer, the range of his or her tastes, the customer's likeliness to churn, and a range of other information useful for customer relations. In these ways, companies are better able to assess the value of its individual customers, and adjust its resources accordingly. More broadly, they can derive comprehensive information on demographic patterns, like distinctions in purchasing patterns between age groups, income levels, and ethnic backgrounds, to discover additional retention and cross-selling possibilities. In this way they can segment their customer bases into specialized marketing focuses. By shifting outreach, advertising, and service resources to effectively capitalize on their diverse clientele, firms can realize cost savings, better conversion rates, and higher margins.
In the e-commerce world, data mining carries an additional range of benefits. In particular, as e-commerce merchants worked to create the maximum amount of value out of what the Web has to offer, they moved to personalize products and services. The extraction of personal information allowed by data mining greatly facilitated this process. By plugging data-mining analysis into customer-service databases and their Web applications, companies can tailor products and services to accord with individual customers' habits and preferences, thereby maximizing value.
Companies use such technology to mine data from within their own ranks as well. Company computer systems and intranets were increasingly searched as a method of retrieving information on key subjects that may have passed between employees at an earlier date, via email transmissions, word processor files, and Web page searches. In addition to harnessing the knowledge buried in these communications, sifting software can also be used to evaluate employees' strengths and weaknesses over a period of time for a comprehensive assessment of the employee's performance. However, while such techniques are attractive to companies, they make privacy advocates nervous with their implications for the retrieval of personal communications and their possible review out of context.
The software industry responsible for data-mining programs was enjoying solid sales growth, which was expected to remain brisk in the early 2000s. The market research firm International Data Corp. estimated that the market for analytic application software would grow from $1.9 billion worldwide in 1999 to $5.2 billion in 2003, while specifically data mining applications from $343 million in 1999 to $1.4 billion in 2004.
Meanwhile, more and more of the world's leading businesses were implementing data mining into their core operations in one way or another. Companies may perform data mining and analysis internally or outsource the job to the growing number of data mining solutions providers. Forrester Research reported that the percentage of Fortune 1,000 firms that planned to incorporate data mining into their marketing strategies grew from 18 percent in 1999 to 52 percent in 2001. Forrester's findings also indicated that the most successful applications of data mining were realized by those firms that most thoroughly embedded data mining into their daily operations.
Cahlink, George. "Data Mining Taps the Trends." Government Executive, October, 2000.
Drew, James H., D.R. Mani, Andrew L. Betz, and Piew Datta. "Targeting Customers with Statistical and Data-Mining Techniques." Journal of Service Research, February, 2001.
Fielden, Tim. "Text-Mining Promises to Cull Answers from Random Text." InfoWorld, October 16, 2001.
Le Beau, Christina. "Mountains to Mine." American Demographics, August, 2000.
Lesser, Eric, David Mundel, and Charles Wiecha. "Managing Customer Knowledge." Journal of Business Strategy, November/December, 2000.
Liddy, Elizabeth D. "Text Mining." Bulletin of the American Society for Information Science, October/November, 2000.
Masi, C.G. "Data Mining Can Tame Mountains of Information." Research & Development, November, 2000.
Murphy, Victoria. "You've Got Expertise." Forbes, February 5, 2001.
Ruquest, Mark E. "Planning is Key to Exploiting Technical Data." National Underwriter, November 27, 2000.
Sullivan, Tom. "Picture This: Data Analysis Becomes More Graphic." InfoWorld, October 16, 2000.
SEE ALSO: Customer Relationship Management (CRM); Database Management
█ BRIAN HOYLE
Data mining refers to the statistical analysis techniques used to search through large amounts of data to discover trends or patterns.
Data mining is an especially powerful tool in the examination and analysis of huge databases. With the advent of the Internet, vast amounts of data are accumulating. As well, the amount of data that can be generated from a single scientific experiment where stretches of DNA are affixed to a glass chip can be staggering. Visual inspection of the data is no longer sufficient to make a meaningful interpretation of the information. Computer-driven solutions are required. For example, to analyze the DNA chip data, the discipline of bioinformatics—essentially a data mining exercise—emerged in the 1990s as a powerful melding of biology and computer science.
The collection of intelligence and the monitoring of the activities of a government or an organization also involves sifting through great amounts of data. Coded information can be inserted into data transmissions. If this information escapes detection, it can be used for undesirable purposes. The ability to extract the suspect information from the background of the other information is of tremendous benefit to security and intelligence agencies.
An example of data mining that is of relevance to espionage, intelligence and security is the use of computer programs—such as the Carnivore program of the United States Federal Bureau of Investigation—to screen thousands of email messages or Web pages for suspicious or incriminating data. Another example is the screening of radio transmissions and television broadcasts for codes.
The formulas used in data mining are known as algorithms. Two common data mining algorithms are regression analysis and classification analysis. Regression analysis is used with numerical data (quantitative data). This analysis constructs a mathematical formula that describes the pattern of the data. The formula can be used to predict future behavior of data, and so is known as the predictive model of data mining.
For example, from a database of terrorists who have corresponded using emails, predictions could be made as to who will send an email and to whom. This would aid efforts to intercept the transmission. This type of data mining is also referred to as text mining.
Data that is not numerical (i.e., colors, names, opinions) is called qualitative data. To analyze this information, classification analysis is best. This model of data mining is also known as the descriptive model.
The data mining process involves several steps:
- Defining the problem.
- Building the database.
- Examining the data.
- Preparing a model to be used to probe the data.
- Testing the model.
- Using the model.
- Putting the results into action.
Database construction and model preparation—in essence the building of the framework for the mining exercise—requires about 90% of the data mining effort. If these fundamentals are done correctly, the use of the model will uncover the data that is of potential significance.
In July 2002, the Intelligence Technology Innovation Center, which is administered by the United States Central Intelligence Agency (CIA), pledged up to $8 million to the National Science Foundation, to bolster ongoing research into data mining techniques. United States intelligence officials suppose that terrorist organizations use Web pages and email to send encoded messages concerning future activities. Currently, unless a message is accidentally uncovered, only monitoring every Internet transmission from a region can reliably discover the covert information.
Also in 2002, the U.S. Federal Bureau of Investigation and the Central Intelligence Agency, under the direction of the Office for Homeland Security, have begun the joint development of a supercomputer data mining system. The system will create a database that can be used by federal, state, and local law enforcement agencies. Currently, the FBI and CIA have their own databases.
Another aspect of data mining is the linking together of data that resides in different databases, such as those maintained by the FBI and the CIA. Often, different databases cannot be searched by the same mechanism, as the language of computer-to-computer communication (protocol) differs from one database to another. This problem also hampers the development of bioinformatics (the computer-assisted examination of large amounts of biological
data). Increasingly, biological and computer scientists are advocating that databases be constructed using a similar template, or that they be amenable to analysis using the same search method.
█ FURTHER READING:
Edelstein, Herbert A. Introduction to Data Mining and Knowledge Discovery, Third Editon. Potomac, MD: Two Crows Corporation, 1999.
Han, Jiawei and Micheline Kamber. Data Mining: Concepts and Techniques. New York: Morgan Kaufmann Publishers, 2000.
What You Need To Know About. "Data Mining: An Introduction." About.com. <http://databases.about.com/library/weekly/aa100700a.htm>(17 December 2002).