Scientific Data Management in Earth Sciences
Scientific data management in Earth sciences
Data constitute the raw material of scientific understanding. They are distinguished in analytical data (i.e. numbers with units) and meta-information (i.e. context describing analytical data). Data management is the control of data handling operations such as acquisition, analysis, quality check, processing, storage, retrieval, distribution, and sharing of data. However, it is not necessarily the generation and use of data. Data management ensures integrity of research, confidentiality, compliance with sponsor's requirements, and protects intellectual property. Scientific data management in Earth Sciences covers at least the four major fields of scientific activities, namely the geosphere, hydrosphere, atmosphere, and biosphere . Data cover time scales ranging from seconds to millions of millennia and provide baseline information for research in many disciplines, among them monitoring environmental changes—gradual or sudden, foreseen or unexpected, natural or man made.
Scientific data gathering has a long history, and evolved from descriptive cataloguing to a relational digital record of expertise. The Chinese chronicled information about solar and aurora activity in past millennia. In the eighteenth and nineteenth centuries, geological data were recorded in expedition reports. Since the middle of the twentieth century, inconceivably huge and heterogeneous numeric data loads came up during large-scale marine projects. At this time, a data management strategy termed 'the box of floppies' approach was developed. The data sets were supplied to the data center as discrete entities (usually on floppy disks) where they were checked, catalogued and stored. On demand, clients were supplied with the data sets necessary to satisfy their requirements. Data management philosophy in these scenarios was firmly focused on data archival. Today, the challenge of scientific data management is to provide standardized import and export routines to support the scientific community with comfortable and uniform retrieval functions and efficient tools for the graphical visualization of their analytical and meta-data through computers.
The unique requirement of data management in Earth Sciences compared to other (natural) sciences is that any datum has to be specified by a space-time geo-code, i.e., geographical (latitude, longitude , sample depth) and time dimensions (date/time, period of time, age [model]). Together with the key parameters sample compartment (e.g., water column, sediment), variable and unit, and principal investigator (i.e. the owner of a data series) any data collection can be mapped however heterogeneous it may be. To describe this so-called ndimensional parameter catalogue, a meta-information catalogue was invented that comprises project fact, campaign information, station data, scientific method , public access status, and reference where the data were published first. Both parameter and meta-information catalogues itemize the analytical value and serve as unambiguous identifier. Validation and verification of data are the two most critical components in scientific data management. Even if scientific data are supposed to be correct, the definition of what is correct is far from straightforward. It can quite often be a matter of opinion, and opinions are subject to change as scientific knowledge changes. For example, the CLIMAP Project Members referred to the Last Glacial Maximum as "18k" (i.e., 18,000 years ago), whereas Bard revised this concept some 25 years later to "21,000 calendar years ago." Each datum reflects current scientific opinion at its time; however, it became subject to change with altered scientific knowledge. It is not essential to have only excellent quality data sets but it is important that exact information on the quality is provided. The user of a specific data set must be able to verify data by reading the reference publication and thus make a decision about the usefulness of retrieved data. Since (yet) unpublished data are even more sensitive than published data, the data management group is obliged to ensure that data are not accessed from outside a project until data are formally placed in public domain.
Consequently, a data management profile in Earth sciences seeks an information system that represents the ndimensional parameter-catalogue and the accompanying meta-information catalogue by a suitable data model and archives the data collection in a way that any datum is described at any stage thoroughly and is traced back to its origin in order to protect copyright. Simultaneously, all interfaces are administered independently during data flow, i.e., from the scientific community to the data management, from the data management to the database, from the database to the data application (e.g., numerical model), from the application through the data management back to the scientific community. Finally, the archived data may be retrieved and presented as raw data and graphics. However, data format can be different each time. A popular conceptual construct of ideas applied to such an approach is the multidimensional view of data. This concept or data model, respectively, formally serves as a basis to inductively generate hypotheses with a search algorithm on specific data sets, which commonly is called data mining.
In practice, the conversion of multidimensional data model and data mining tool in Earth sciences is carried out by the International Council of Scientific Unions' World Data Center system (WDC). It works to guarantee access to any data in all fields of Earth sciences on a long-term basis. The categories of World Data Centers read like a Who's Who in Earth sciences: Air glow (Tokyo, Japan), astronomy (Beijing, China), atmospheric trace gases (Oak Ridge, United States), aurora (Tokyo, Japan), cosmic Rays (Toyokawa, Japan), Earth tides (Brussels, Belgium), geology (Beijing, China), geomagnetism (Copenhagen, Denmark; Edinburgh, United Kingdom; Kyoto, Japan; Mumbai, India), glaciology (Boulder, United States; Cambridge, United Kingdom; Lanzhou, China), human interactions in the environment (Palisades, United States), ionosphere (Tokyo, Japan), marine environmental sciences (Bremen, Germany), marine geology and geophysics (Boulder, United States; Moscow, Russia), meteorology (Asheville, United States; Beijing, China; Obninsk, Russia), nuclear radiation (Tokyo, Japan), oceanography (Obninsk, Russia; Silver Spring, United States; Tianjin, China), paleoclimatology (Boulder, United States), recent crustal movements (Ondrejov, Czech Republic), remotely sensed land data (Sioux Falls, United States), renewable resources and environment (Beijing, China), rockets and satellites (Obninsk, Russia), rotation of the Earth (Obninsk, Russia; Washington, United States), satellite information (Greenbelt, United States), seismology (Denver, United States; Beijing, China), soils (Wageningen, The Netherlands), solar activity (Meudon, France), solar radio emissions (Nagano, Japan), solar terrestrial physics (Boulder, United States; Didcot Oxon, United Kingdom; Moscow, Russia; Haymarket, Australia ), solid Earth geophysics (Beijing, China; Boulder, United States; Moscow, Russia), space science (Beijing, China; Sagamihara, Japan), sunspot index (Brussels, Belgium).
Since the early beginnings of modern scientific data management in Earth sciences, the gathering and exchange of data has been transformed by rapid technological advances, such as the replacement of analog with digital instruments, the networking of digital instruments to simplify collection and exchange of data, unmanned automatic observatories etc. Personal computers and compact disc readers are ubiquitous. Many World Data Centers publish collections of digital data sets on compact discs for easy distribution. Digital communication networks make it possible to transfer large data files by electronic mail. Environmental disciplines make use of mapbased data through Geographical Information Systems. The collaboration of international scientific bodies ensures the continuation of long-term monitoring of the Earth system, the permanent preservation of the data acquired for the mutual benefit of the international scientific community, and the dissemination mechanisms through publications, workshops, exhibitions, and other means.
See also GIS; Ice ages; International Council of Scientific Unions World Data Center System