April 17 - 19, 2007   
Hyatt Regency Phoenix   
Phoenix, Arizona   

 

Position papers

eDatabase Lessons for an eData World

 
   

NSF/JISC Repositories Workshop
Rick Luce, Emory University
April 9, 2007
Download: PDF Version  WORD version

After witnessing at close hand the last two decades of transformation from paper to digital, wherein evolution to digital publishing and digital libraries largely has followed the linear thinking of replicating processes rather than transforming processes, and where the early phases of the paradigm shift could be characterized by essentially a duplication of the current print medium, what observations and lessons might be applied to the many problems confronting data science and massive repositories?

Steering Clear of Alexandria

The free-enterprise model requires competition to allow the best solutions to emerge by stimulating the experimentation and urgency required to bring new ideas rapidly in to play. Experience has shown that centralized solutions may not be the best option for ensuring innovation over time. The notion of a centralized data archive has some attractive advantages, notably the convenience of a one-stop shop but it also carries significant disadvantages. Centralized approaches are traditionally more vulnerable to failure and a tendency to pick winners prematurely.

The shadow of politics looms over any initiative dominated by a single organization (e.g., ACS) or country, and centralized repository approaches are typically discipline centric, which engenders problems categorizing new, trans-disciplinary science. It is an opportune time to experiment and rethink the assumptions that underlie our systems, but let us start by examining some issues in our current system that remain unsolved prior to being scaled up to support eScience.

Systems Science in a Data Driven World

Today much of the exciting and innovative developments in science can be found at the intersection of trans-disciplinary domains, leading to the need for new conceptualizations of the infrastructure supporting systems science and the emergence of data science. A foundational prerequisite for systems science is the integration of heterogeneous experimental data, which today are stored in numerous domain specific databases. However, a wide range of obstacles that relate to access, handling, and integration impede the efficient use of the contents of these databases. An examination of a few of the limitations surrounding the use today’s scientific databases might be insightful in thinking about one dimension of the data repository challenges ahead.

Massive amounts of data produced on a daily basis require more sophisticated management solutions compared to today’s database environments, and the availability of the Internet as an enabling infrastructure for scientific exchange has created new demands for data accessibility. Furthermore, new fields such as earth systems science, computational pathomics, climate change, biogeochemistry, paleo-climate, and systems biology has further increased the requirements that are demanded of databases and data repositories. Although we require systems that support ubiquitous knowledge and information environments, many issues arise requiring better solutions. The limitations that characterize our current database environment will be increasingly magnified in an era of eScience; some of those limitations include:

Finding Relevant Sources – Even in the Google era it is difficult to identify suitable data sources and well described repositories via the web.  A first step in building models that support transdisciplinary science requires the researcher to locate relevant data repositories and databases outside of one’s known field. One critical component of the emerging cyber-infrastructure is the array of instruments and sensors deployed on the grid. We need to create a global registry of instruments and sensors so that scientists and scientists-in-training can obtain information about them, including how to use them. A description, at a minimum, of the relevant dataset or database contents and the way in which the data are produced and/or derived from other data sources is mandatory. Unfortunately, well-described global registries are not the norm and not every database provides such meta-information.

Data Processing – Imagine trying to support collaborative eScience projects without large-scale, automated data processing. In an era where we’d like the data to speak to the data, today a large number of scientific database aren’t equipped with programming interfaces enabling software developers to query these databases from within their own programs and systems. Although current production systems can support standardized interfaces, public access to these interfaces is rarely provided. The rationale for denial ranges from security concerns to financial considerations. Web-based access is unsuitable for bulk queries and programming interfaces are only rarely available. When data downloading is not an option, contents must be extracted from the web interface. This sub-optimized approach requires customized data-extraction software for each data source, and has many technical limitations.

Where downloading is supported, flat files are often still the de facto standard for data exchange. Lacking a standardized format for flat files, many formats for the thousands of data collections exist. Self-described XML files that could be readily harvested would solve many of these problems, since generic XML parsers are widely available but only a very small number of databases are currently provided in XML. The importance of XML has been increasingly recognized, and standardized XML-based data-exchange formats should be strongly encouraged.

Content, Missing Content - Many useful types of information are missing in widely used databases, however, little incentive currently exists to (re)supply the missing data. As a standard practice, funding agencies should require the submission of fully described results to public databases, which is not the rule in all domains. To minimize the risk of human error during data submission, appropriate curation protocols and supporting software must be implemented. Since errors in data repositories and databases are a known problem, data providers should implement appropriate means report, track and correct errors in a timely manner. 

Can’t We Talk – It seems almost too obvious to state that we need close bi-directional communication between database providers and users to address problems. While the web 2.0 world has begun to adopt social software and connectedness as a means of collaborating, in the database world many providers still desire to control their silo and consequently are not open about their data-curation processes, nor schema and contents changes. Simple as it may seem, error reporting and tracking is not the rule.

Missing in Education - Many use problems with scientific databases can be traced to a lack of interest and basic understanding of data management on the part of the scientific expert, while informaticians may not be aware of the domain needs. Because communication difficulties that arise from these problems clearly have educational roots, the learning curricula for both informaticians and research scientists should be better defined to equip future practioners.

Access – Financial and political issues drive the most controversial dimension, that of ubiquitous access to data and databases. The most important problem here is the question of free vs. fee access to scientific data and databases. It seems obvious that free access for all to scientific data and databases would be beneficial, but the reality is more complex. Data curation with highly qualified staff is costly, and as a result sustainability and financial issues arise. Most funding agencies do not provide long-term support for data curation, and alternative funding models are required. Depending on the funding model selected, different trade-offs result. Some important databases are not publicly available (Chemical Abstracts), while others are freely accessible through a web interface, although downloading is not permitted. Some providers’ block requests from entire domains when they suspect someone is attempting to 'steal' their data using automated data parsing from a web interface. Licensing conditions of 'free' licenses may impose considerable obstacles, e.g., when database providers demand that the origin of the data is transparent to the user. Another licensing problem is the redistribution of data, which may not be permitted. The newest wrinkle is the demand of co-authorship in any publication that makes use of the database in any way. Clearly a universal legal framework for database interoperability is overdue.

Curation Requires Funding -The importance of databases is fundamental to entire disciplines such as chemistry and biology, however, long-term curation efforts are rarely supported and most publicly available database providers have funding problems. Funding for long-term curation of data repositories and scientific databases is required, and one can only wonder at the eventual state of massively scaled data repositories a decade hence if this is ignored.

An evolutionary direction - the Adaptive Web

Since we are in the early stages of developing the new paradigm(s) required to support data science and constructing solutions for massively scaled data repositories, we have the opportunity (and obligation) to creatively reconceptualize our approach, less we magnify the current limitations in the scholarly communication chain.

Increasingly, value resides in the relationships between researchers, papers, experimental data and the ancillary supporting materials, associated dialogue from comments and reviews, updates to the original work, etc. Typically, when hypertext browsing is used to follow links manually for subject headings, thesauri, textual concepts and categories, the user can only traverse a small portion of a large knowledge space. To manage and utilize the potentially rich and complex nodes and connections in a large knowledge system such as the distributed web, system-aided reasoning methods would be useful to suggest relevant knowledge intelligently to the user.

As our systems grow more sophisticated, we will see applications that support not just links between authors and papers but relationships between users, data and information repositories, and communities. What is required is a mechanism to enable communication between these relationships that leads to information exchange, adaptation and recombination – which, in itself, will constitute a new type of data repository. A new generation of information retrieval tools and applications are being designed that will support self-organizing knowledge on distributed networks driven by human interaction. This capability would allow a physicist or biochemist to collaborate with colleagues in the life sciences without having to learn an entirely new vocabulary.

Recent notable examples where decentralized efforts have succeeded with innovative approaches include diverse experiences such as decoding the human genome, the open source movement and peer-to-peer networks. It would be in our best long-term interests to optimize our communication systems to support a variety of approaches while we evolve our understanding of the coming adaptive web and its impact on building our data repositories that support both current and new forms of scientific communication. If we believe it is prudent to hedge our bets, many alternatives should be propagated and stimulated.