Data Processing and Data Management

views updated

Data Processing and Data Management

Most data management methods draw distinction between data, information, and knowledge. Data is specifically a collection of mathematical truths and facts, an is statement of some sort, without any interpretation. Information is data that has context, showing movement and action of some specific entity. When data communicates a clear change, it has become information. Knowledge is the third form, essentially information in the hands of an experienced analyzer. Knowledge communicates what is likely to happen, how a company can use information, and what the implications of information are for the company.

In the eLearning Training company's 2007 Making Sense of Data and Information, several ways are given to convert data to information, and information to knowledge. Data is made of facts, often factual numbers such as sales figures, accident records, production output, costs of resources, and so on. Upon conversion to information, the data is put into a specific framework. This framework could be something like charts that identify certain trends, averages, and/or significant variables. Generally, to convert data to information, a clear set of data must be specified as affecting the business, followed by a clear and regular method of collecting the pertinent data, and joined by the ability to convert the data in some form of analysis, such as computer programs and mathematical formulas.

There are many processes used to collect, define, and organize data into information. Some of the tools analysts use include the following:

Data cleaning is the process of removing unnecessary or cluttering data that has no relevance to factors the analysis is focusing on.
Data integration is the process of combining similar sets of data to make analysis easier.
Data selection is the process, often automated, of retrieving pertinent sets of data from some type of data collection.
Data transformation involves placing the data into a form in which it can be accessed for specific types of analysis.
Data mining involves using carefully designed methods to access and gather data.

Changing information to knowledge is a more nebulous area, affected largely by the skills of the analysts and the needs of the company. Some business knowledge comes directly from information: production methods based on reliable manufacturing data is one type of knowledge directly branching from collected information. Reliable company policies use such knowledge as the basis of their operation. Other types of knowledge exist in companies but are more difficult to trace to information. This includes intuitive and experiential knowledge, such as the skills of an employee who is able to predict the outcome of data analysis because of previous work with sets of data.

Most of the methods discussed below pertain to changing data, the raw facts, into types of information. Data management includes the methods, concepts, and trends connected to this process.

CHARACTERISTICS OF VALUABLE INFORMATION

According to Ralph M. Stair in his 1999 book, Principles of Information Systems, in order for information to be valuable, it must be:

Accurate. Accurate information is free from error.
Complete. Complete information contains all of the important facts.
Economical. Information should be relatively inexpensive to produce.
Flexible. Flexible information can be used for a variety of purposes, not just one.
Reliable. Reliable information is dependable information.
Relevant. Relevant information is important to the decision-maker.
Simple. Information should be simple to find and understand.
Timely. Timely information is readily available when needed.
Verifiable. Verifiable information can be checked to make sure it is accurate.

DATA GOVERNANCE

Data governance refers to the methods, people, and policies involved in managing data. Those who work in data governance develop the strategies the company uses to sort information and change it into useful knowledge. There are a number of titles given to those involved in data governance, including data trustee, data steward, data registrar, data administrator, director of information flow, and director of application development. Although the tasks for these employees vary, database security and database development are the two primary areas of data governance. For those who wish to develop a data governance system or department for their organization, David Loshin, in his 2001 book, Enterprise Knowledge Management, suggests several steps for bringing together the right individuals:

Identify those who have a clear interest in the results of developed data. This could include specific investors, members of the corporate board who are in charge of security or information systems, and other company leaders who would be involved in data governance.
Create a full list of the data sets to be put under the data governance system.
Define who is currently in charge of the data systems, and determine whether or not those titles and employee positions should persist.
Create the necessary new roles in data governance, and assign those roles to interested parties.
Develop a registry that contains all information involved in data governance, as well as the lists and roles that have so far been created.

Once a data governance system has been established, there are many ways to begin managing the company data itself. In their 2007 book, Master Data Management and Customer Data Integration for Global Enterprise, Berson and Dubov give three activities often undertaken by data governance systems.

First, the process should be defined and implemented. This requires setting up the proper connections between the main storage of data and those who need to use it—the analysts and data governance leaders. How is the data going to get into the system? How many people will able to use it? To what standard is the data going to be set? How will like sets of data be linked for easy access? How will duplicate sets of data be resolved into one form? These are the sort of questions the data governance process needs to answer. This is done through a detailed integration of the system and the users, setting up steps to download the data, refine the data, and gather the data into proper families. In addition to setting up these important methods, the process should also include a manual way to investigate and delete unnecessary data (this is a correlation to automatic correction systems).

Second, some form of technology should be chosen and implemented to organize and deliver the data. There are many programs capable of doing this, and the company should choose a system tailored to meet its specific needs. Who will be using the data technology? Will it be a subset of an established intranet system? How will it label the data for those who need to access it? When considering a form of data software, there are several abilities data governance leaders should watch for, including:

The ability to automatically monitor incoming data from all offices or data hubs
The ability to track changes in data standards and organization over time, to better understand the accuracy and usability of the data presented
The ability to incorporate data from other systems besides the company's, and deal with sudden changes in data or data organization
The ability to provide a consistent and reliable platform to base data decisions on

Third, a company should make sure all their data is available for auditing and accountability purposes. Outside confirmation of the quality and purposes of a company's data is essential. Legislation such as the 2002 Sarbanes-Oxley law requires that businesses sign off on the accuracy and clarity of their information, so having checks and balances set up within a data governance system can be crucial in compliance issues. Data stewards often have the task of ensuring data accountability.

Data can be organized and connected in several ways, one of the most common electronic methods being the database approach. The database approach is such that multiple business applications access the same database. Consequently, file updates are not required of multiple files. Updates can be accomplished in the common database, thus improving data integrity and eliminating redundancy. The database approach provides the opportunity to share data, as well as information sources. Additional software is required to implement the database approach to data management. A database management system (DBMS) is needed. A DBMS consists of a group of programs that are used in an interface between a database and the user, or between the database and the application program. Advantages of the database approach are presented in Table 1. Disadvantages of the database approach are presented in Table 2.

METADATA

Metadata is defined most commonly as “data about data,” and is essential in creating data management methods. At its basic form, metadata is the labels and categories placed on data to make analysis easier. For instance, the metadata for a book would contain—not the book itself—but the author, language, and ISBN of the book. Most people encounter and manipulate metadata when searching for subjects on the Internet. The bits of information pertaining to Web sites that most search engines list is all metadata, and it is sifted through by the searcher to find pertinent data.

In a company's data governance system, metadata is used to classify and control the data available. When analysts chose and manipulate large data groups, they do so through the information collected from metadata. The file type, the name, the timestamp, the physical and electronic location, the owner, and the access permissions are all common types of metadata found in company file systems.

DATABASE MODELS

The structure of the relationships in most databases follows one of three logical database models: hierarchical, network, and relational.

A hierarchical database model is one in which the data are organized in a top-down or inverted tree-like

Table 1 Advantages of the Database Approach
Advantages	Explanation
Reduced data redundancy	The database approach can reduce or eliminate data redundancy. Data is organized by the DBMS and stored in only one location. This results in more efficient utilization of system storage space.
Improved data Integrity	With the traditional approach, some changes to data integrity were not reflected in all copies of the data kept in separate files. This is prevented with the database approach because there are not separate files that contain copies of the same piece of data.
Easier modification and updating	With the database approach, the DBMS coordinates and updating updates and data modifications. Programmers and users do not have to know where the data is physically stored. Data is stored and modified once. Modification and updating is also easier because the data is stored at only one location.
Data and program independence	The DBMS organized the data independently of the independence application program. With the database approach, the application program is not affected by the location or type of data. Introduction of new data types not relevant to a particular application does not require the rewriting of that application to maintain compatibility with the data file.
Better access to data and information	Most DBMSs have software that makes it easier to data and information access and retrieve data from a database, in most cases, simple commands can be given to get important information. Relationships between records can be more easily investigated and exploited, and applications can be more easily combined.
Standardization of data access	A primary feature of the database approach is a of data access standardized, uniform approach to database access. This means that the same overall procedures are used by all application programs to retrieve data information.
A framework for program	Standardized database access procedures can mean program more standardization of program development. Because programs go through the DBMS to gain access to data in the database, standardized database access can provide a consistent framework for program development. In addition, each application program need only address the DBMS, not the actual data files, reducing application development time.
Better overall protection of the data	The use of and access to centrally located data is protection of the easier to monitor and control. Security codes and data passwords can ensure that only authorized people have access to particular data and information in the database, and ensure privacy.
Shared data and information resources development	The cost of hardware, software, and personnel can information be spread over a large number of applications and users. This is a primary feature of DBMS.

structure. This type of model is best suited for situations where the logical relationships between data can be properly represented with the one-parent-many-children approach.

A network model is an extension of the hierarchical database model. The network model has an owner-member

Table 2 Disadvantages of the Database Approach
Disadvantages	Explanation
Relative high cost of purchasing and operating a DBMS in a mainframe operating environment	Some mainframe DBMSs can cost millions of dollars.
Specialized staff	Additional specialized staff and operating personnel may be needed to implement and coordinate the use of the database. It should be noted, however, that some organizations have been able to implement the database approach with no additional personnel.
Increased vulnerability	Even though databases offer better security because security measures can be concentrated on one system, they also may make more data accessible to the trespasser if security is breached. In addition, if for some reason there is a failure in the DBMS, multiple application programs are affected.

member relationship in which a member may have many owners, in contrast to a one-to-many relationship.

A relational model describes data using a standard tabular format. All data elements are placed in two-dimensional tables called relations, which are the equivalent of files. Data inquiries and manipulations can be made via columns or rows given specific criteria.

Network database models tend to offer more flexibility than hierarchical models. However, they are more difficult to develop and use because of relationship complexity. The relational database model offers the most flexibility and was very popular during the early 2000s.

DATABASE MANAGEMENT SYSTEMS

As indicated previously, a database management system (DBMS) is a group of programs used as an interface between a database and an applications program. DBMSs are classified by the type of database model they support. A relational DBMS would follow the relational model, for example. The functions of a DBMS include data storage and retrieval, database modifications, data manipulation, and report generation.

A data definition language (DDL) is a collection of instructions and commands used to define and describe data and data relationships in a particular database. File descriptions, area descriptions, record descriptions, and set descriptions are terms the DDL defines and uses.

A data dictionary also is important to database management. This is a detailed description of the structure and intended content in the database. For example, a data dictionary might specify the maximum number of characters allowed in each type of field and whether the field content can include numbers, letters, or specially formatted content such as dates or currencies. Data dictionaries are used to provide a standard definition of terms and data elements, assist programmers in designing and writing programs, simplify database modifications, reduce data redundancy, increase data reliability, and decrease program development time.

The choice of a particular DBMS typically is a function of several considerations. Economic cost considerations include software acquisition costs, maintenance costs, hardware acquisition costs, database creation and conversion costs, personnel costs, training costs, and operating costs.

Most DBMS vendors are combining their products with text editors and browsers, report generators, listing utilities, communication software, data entry and display features, and graphical design tools. Consequently, those looking for a total design system have many choices.

DATA MINING

The process of collecting applicable data is referred to as data mining. Most data governance systems have automatic data-mining programs designed to fit the analysts' needs. These programs sort through and summarize data according to certain parameters. If a company wanted to cut costs in manufacturing, for instance, a data-mining activity would be done to search for figures and facts concerning manufacturing, collect that data into different categories such as supply costs and worker costs, and finally transmit the categorized data to the proper people.

To identify patterns, data mining is controlled by strict guidelines. For instance, one analysis might require a data mine of all associative information events that connected to each other through some type of relationship. Sequential patterns are another type of data mining parameter, where data is found in events that naturally lead to one another, such as the supply chain. Some data mining activities focus on classification and the search for new patterns in the available data. Still more data mining might be done to predict trends or outcomes of particular events.

Although data mining is an immensely popular tool, it does have blind spots. For the data mining to be useful, a skilled analyst must set clear parameters and interpret the data correctly. Data mining does not make value judgments or attribute importance.

DATA CLUSTERING

Data clustering is often confused with data mining, but the two are separate actions. Data clustering is a subset of data mining, and is usually performed first in a datamining

activity. It is an automatic function, based on mathematic principles, that groups data into similar categories. When a data-mining operation is run, the data is clustered into applicable areas such as “Costs” and “Profits.” The data-mining operation then goes on to apply more required parameters to the data.

DATA WAREHOUSING

Data warehousing involves taking data from a main computer for analysis without slowing down the main computer. In this manner, data are stored in another database for analyzing trends and new relationships. Consequently, the data warehouse is not the live, active system, but it is updated daily or weekly. For example, Wal-Mart uses a very large database (VLDB) that is 4 trillion bytes (terabytes) in size. Smaller parts of this database could be warehoused for further analysis to avoid slowing down the VLDB.

FUTURE TRENDS

A private database is compiled from individual consumer or business customer names and addresses maintained by a company for use in its own marketing efforts. Such a database may have originated as a public database, but typically once the company begins adding or removing information it is considered a private database. By contrast, public databases are those names, addresses, and data that are complied for resale in the list rental market. This is publicly available data (i.e., any business can purchase it on the open market) rather than lists of specific customers or targets.

However, a new trend is combining features of the two approaches. Cooperative databases are compiled by combining privately held response files of participating companies so that costs are shared. Many consider this to be a future trend, such that virtually all catalog marketers, for example, would use cooperative databases.

Geographic Information Systems (GIS) are becoming a growing area of data management. GIS involves combining demographic, environmental, or other business data with geographic data. This can involve road networks and urban mapping, as well as consumer buying habits and how they relate to the local geography. Output is often presented in a visual data map that facilitates the discovery of new patterns and knowledge.

Customer Resource Management (CRM) is another area where data process and data management is deeply involved. CRM is a set of methodologies and software applications for managing the customer relationship. CRM provides the opportunity for management, salespeople, marketers, and potentially even customers, to see sufficient detail regarding customer activities and contacts. This allows companies to provide other possible products or useful services, as well as other business options. Security of this information is of significant concern on both sides of the equation.

SEE ALSO Computer Networks; Computer Security

BIBLIOGRAPHY

Berson, Alex, and Larry Dubov. Master Data Management and Customer Data Integration for Global Enterprise. New York: McGraw-Hill Publishing, 2007.

Chu, Margaret Y. Blissful Data: Wisdom and Strategies for Providing Meaningful, Useful, and Accessible Data for All Employees. New York: American Management Association, 2003.

Churchill, Gilbert A. Marketing Research: Methodological Foundations. 8th ed. Cincinnati, OH: South-Western College Publishing, 2001.

“Data Governance Channel.” DMReview.com, 2008. Availabe from http://www.dmreview.com/channels/data_governance.html.

“Data Mining.” SearchSQL, 2006. Available from: http://searchsqlserver.techtarget.com/sDefinition/0,sid87_gci211901,00.html.

Elearn Training Company. Making Sense of Data and Information. Elsevier Publishing, 2007.

Han, Jiawei, and Micheline Kamber. Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann Publishing, 2000.

Loshin, David. Enterprise Knowledge Management: The Data Quality Approach. San Francisco: Morgan-Kaufmann Publishing, 2001.

“Metadata Definition.” The Linux Information Project Linux, 2006.

Palace, Bill. “Data Mining: What is Data Mining?” Anderson Graduate School of Management of UCLA, 1996. Available from: http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/index.htm.

Seifert, Jeffrey. Data Mining: An Overview CRS, 2004.

Stair, Ralph M. Principles of Information Systems: A Managerial Approach. 4th ed. Cambridge, MA: Course Technology, 1999.

Wang, John. Data Mining: Opportunities and Challenges. Hershey, PA: Idea Group Publishing, 2003.

White, Ken. “DBMS Past, Present, and Future.” Dr. Dobb's Journal 26, no. 8 (2001): 21–26.