Data warehousing refers to the organization and assembly of data created from day-to-day business operations. Data warehousing enables a user to retrieve data from online transaction processing (OLTP) and online analytical processing (OLAP), and allows for the storage of that data in a format that can be read and analyzed. The integrated information, which is stored in a data warehouse, can be analyzed and queried to help management make more informed business decisions.
The idea of data warehousing dates back to the early 1980s. At that time, a popular system that utilized the concept of data warehousing was the relational database, which was run on minicomputers and used for OLTP functions. Quite often, relational database systems operated networks such as automated teller machines. As technology continued to advance, several key factors—including changing business trends, the evolution of the global economy, enterprise resource planning (ERP), business process reengineering (BPR), increased focus on customer needs, and the rise of e-business—led to the development of data warehouses in the 1990s.
Run on powerful client/server networks, not only can data warehouses read OLTP, they are equipped to translate OLAP as well. The development of data warehousing enabled companies to gather several types of information concerning business transactions, as well as important analytical data. In Contract Professional Magazine, Pam Derringer wrote that as a knowledge tool, "data warehousing restructures massive volumes of unorganized data into new formats that can be queried for answers to individual questions or sliced and diced for analytical trend reports."
Two important types of information in data warehousing are operational and informational data. Operational data—the data businesses use on a day-to-day basis—is stored, retrieved, and updated by an OLTP system. This type of data normally is stored in a relational database. Informational data is operational data that has been manipulated and summarized, and is what makes up a data warehouse. In the process of data warehousing, informational data is created from operational data and systems by using transformation or propagation tools. This process is necessary to ensure that the information can be retrieved in an easy and time-efficient manner. Multidimensional analysis, or OLAP, is the desired result of data warehousing. It allows a user to analyze large amounts of data regarding things like sales, products, time periods, and geographies. The multidimensional data structure, or data warehouse, allows for the storing and analyzing of such data.
Another component to data warehousing is meta-data, which is made up of technical data and business data. Technical data is used by system administrators and contains information about the data warehouse itself. Business data, on the other hand, is what an analyst might be searching for in order to forecast sales or predict trends. Data mining tools are then used to interpret data and find patterns within the information. For example, a retail company might use data warehousing and data mining to find relationships in purchasing patterns and to gather information about its customers.
Implementing a data warehouse structure within a company can be a costly and time-intensive process. These barriers have led to the development of data marts—smaller versions of data warehouses that are more specialized to serve a specific department and/or cover a specific topic. Traditional data warehouses are measured in gigabytes and terabytes, whereas the more compact data marts are measured in megabytes. Smaller companies, with more limited budgets, often opt for this type of data structure.
With the rise of e-commerce, data warehousing is becoming a key business component in the operation of both brick-and-mortar companies as well as dot-com ventures. The evolution of customer relationship management (CRM), the increasing popularity of the Internet, and the formation of online marketplaces and business-to-consumer companies such as Amazon.com and other e-tailers, have increased the demand for data warehousing solutions. While implementing data warehousing can be very costly, a study conducted by International Data Corp. concluded that firms utilizing data warehouse systems saw an average return on investment of nearly 400 percent over three years. Each year, as the billions of dollars spent online for products and services increases, businesses are turning to advanced data management solutions to analyze information, make forecasts, look for trends, identify shopper characteristics, and control inventory. This has increased the demand for data warehousing and, therefore, increased competition between solution-based companies. According to DM Review, the top ten business intelligence vendors—those offering e-business, CRM, and data warehousing solutions—at the advent of the twenty-first century were SAS, NCR Corp., Oracle Corp., Computer Associates, Cognos Corp., MicroStrategy Inc., Microsoft Corp., IBM, Informix Business Solutions, and Hyperion.
"2000 DM Review 100 Numerical Ranking." DM Review. November 2001. Available from www.dmreview.com/awards/top100/2000.
"Data Warehousing Concepts for AS/400." Armonk, NY: IBM Corp., 2000. Available from www-1.ibm.com.htm.
Derringer, Pam. "Data Warehousing: The Next Boom?" Contract Professional Magazine. 2000. Available from www.freeagent.com.
Moye, Joe, and Dave Upton. "Data Warehousing 101." Strategic Finance. February, 2001.
Schroeck, Michael. "Data Warehousing: The Past 10 Years Have Been Quite a Ride." DM Review. February, 2001. Available from www.dmreview.com.
SEE ALSO: Data Mining; Database Management; Information Management Systems
With the advent of the information age, the amount of digital information that is recorded and stored has been increasing at a tremendous rate. Common data formats for storage include commercial relational database engines, often interconnected via an intranet , and more recently World Wide Web sites connected via the Internet. The interconnectivity of these data sources offers the opportunity to access a vast amount of information spread over numerous data sources. Modern applications that could benefit from this wealth of digital information abound, and they range over diverse domains such as business intelligence (e.g., trade-market analysis or online web access monitoring), leisure (e.g., travel and weather), science (e.g., integration of diagnoses from nurses, doctors, and specialists about patients), libraries (e.g., multimedia online resources like museums and art collections), and education (e.g., lecture notes, syllabi, exams, and transparencies from different web sites). The one common element among all these applications is the fact that they must make use of data of multiple types and origins in order to function most effectively. This need emphasizes the demand for suitable integration tools that allow such applications to make effective use of diverse data sets by supporting the browsing and querying of tailored information subsets.
In contrast to the on-demand approach to information integration, where applications requests are processed on-the-fly, the approach of tailored information repository construction, commonly referred to as data warehousing, represents a viable solution alternative. In data warehousing, there is an initial setup phase during which relevant information is extracted from different networked data sources, transformed and cleansed as necessary, fused with information from other sources, and then loaded into a centralized data store, called the data warehouse. Thereafter, queries posed against the environment can be directly evaluated against the pre-computed data warehouse store without requiring any further interaction and resultant processing delay.
Data warehousing offers higher availability and better query performance than the on-demand approach because all data can be retrieved directly from one single dedicated site. Thus, it is a suitable choice when high-performance query processing and data analysis are critical. This approach is also desirable when the data sources are expensive to access or even sometimes become unavailable, when the network exhibits high delays or is unreliable, or when integration tasks such as query translation or information fusion are too complex and ineffective to be executed on-the-fly.
However, such a static snapshot of the data kept in a data warehouse is not sufficient for many real-time applications, such as investment advising. Hence updates made to the data in individual sources must be reflected in the data warehouse store. This can be accomplished by a complete reload of the data warehouse store on some periodic schedule, say once a day during the off-peak business time. Given the size of many modern data warehouses, such a reload is often too time consuming and hence not practically feasible. This has led to the development of strategies for incremental database maintenance, a process whereby a data warehouse is updated more efficiently with information that is fed into an existing database.
Many types of systems benefit from such a data warehousing paradigm . The first category includes monolithic systems, where one organization controls both the single data source providing the data feed as well as the back-end data warehouse store. An online purchasing store such as Amazon.com has, for example, the web-based front end that handles high-performance transactions by customers, whereas the underlying data warehouse serves as a container of all transactions logged over time for offline analysis. The second category includes distributed yet closed environments composed of a small number of independent data sources controlled by trusted owners with a joint cooperative goal. An example would be a hospital information system that attempts to integrate the data sources maintained by different units such as the personnel department, the pharmacy, and the registration system. Large-scale open environments such as the World Wide Web represent the third category where unrelated sources come and go at unpredictable times and the construction of temporary data warehouses for new purposes are common.
These data warehousing systems often feature a multi-tier architecture. The individual data sources in a networked environment are at the bottom tier. These sources often are heterogeneous, meaning that they are modeled by diverse data models and each support different query interfaces and search engines. This may include legacy systems, proprietary application programmer interfaces, traditional relational database servers, or even new technology such as web sites, SGML or XML web documents, news wires, and multimedia sites. Due to the heterogeneity of the data sources, there is typically some wrapper software associated with each data source that allows for smoother communication between the queries and processes associated with both the new data and the data warehousing system.
The software tools in the middle tier, collectively referred to as the data warehouse management system, are dedicated to diverse integration services. These software tools offer services beyond those common to a traditional database engine. For example, there may be tools for filtering and cleansing information extracted from individual data sources, for intelligently fusing information from multiple sources into one integrated chunk of knowledge, or for incrementally keeping the data warehouse up-to-date under source changes.
Finally, the actual data warehouse store is (at least logically) a centralized database repository that must support complex analysis queries at high levels of performance. In current systems, such a data warehouse store is built using standard relational database servers due to the maturity of this technology. Such complex decision and analysis query support on databases is commonly referred to as online analytic processing. Depending on the requirements of the application, additional data analysis services may be built on top of the integrated data warehouse store. This may include graphical display systems, statistics and modeling packages, and even sophisticated data mining tools that enable some form of discovery of interesting trends or patterns in the data.
see also Data Mining; Database Management Software; E-commerce.
Elke A. Rundensteiner
Bulletin of the Technical Committee on Data Engineering, Special Issue: Materialized Views and Data Warehousing, 18, no. 2 (1995): 2.
Chaudhuri, Surajit, and Umeshwar Dayal. An Overview of Data Warehousing and OLAP Technology. (ACM Special Interest Group on Management of Data) ACM SIGMOD Record 26 (1997): 65-74.
Rundensteiner, Elke A., Andreas Koeller, and Xin Zhang. "Maintaining Data Warehouses over Changing Information Sources." Communications of the ACM 43, no. 6-(2000): 57-62.