June 15 - 17, 2003   
Wequassett Inn, Cape Cod   
Chatham, Massachusetts   
NSF/JISC Workshop
 
General
Welcome
Background
Agenda
References
Important Dates
Participants List
   
OUTREACH
China 2004
Bangalore 2005
   
For Contributors
Call For Papers
Papers
Breakout Reports
Final Report
Opening Plenary Session
Supplementary Contributions
   
For Participants
Expense Form
Accommodation
Tourist info
Travel
   
Organization
Sponsors
Contacts
   

 

   
Papers  
   
Roadblocks
 
   
   

Herbert Van de Sompel, Los Alamos National Laboratory, Research Library
Many thanks to Rick Luce for sharing his insights
Download: PDF Version    WORD Version

Introduction

I have no problems with admitting that thinking about a 10-year timeframe poses significant challenges for me. Going through the simple thought experiment of moving myself back in time to 1993, and using that perspective to try and figure which components of the current information infrastructure I would have been able to predict, or which level of global penetration each of those would be able to achieve, leave me rather speechless. Yes, after having seen an early glimpse of the Web through the lens of a Mosaic browser made me understand intuitively that this was going to be important. Real important, as opposed to how, for instance, Gopher was “interesting”. But predicting that the Web along with then unknown related technologies would transform our world and world-view in general, as well as everything we did in libraries both traditional, digital and hybrid specifically was way beyond my capabilities as a futurist. Therefore, I have chosen not to directly address the future DL research agenda, but rather, I have concentrated on trying to identify roadblocks in the DL arena that I feel need to be abolished in order to pave the way for the vision expressed in the Cyber infrastructure Report. The items I identify are situated at the level of basic plumbing - things that need to be in place so that other great things can happen. Therefore, they may not come across as all that exciting or innovative, however, I feel that failing to address them may seriously impede efforts to move towards an integrated scholarly knowledge environment, as envisioned in the Cyber infrastructure Report. Most issues I identify pose enormous challenges, but they are not merely technical research challenges. Rather, they cover a broad range of research and organizational challenges at both the technical, legal, economical and social level. I present them in no particular order and hope that, when combined with ideas from other participants, the thoughts communicated hereafter will constitute a contribution to the goals of the meeting.

Some perceived roadblocks

(1) DL Theory - The field exists for over a decade, and has produced significant advances. Yet, it seems to me that further advancement may actually be hindered by the lack of an accepted base-line Theory for Digital Libraries. I am hereby thinking of a specification such as the one describing the OAIS model, which has significantly enhanced our ability to communicate about, identify research areas, and do actual research in the realm of digital preservation. Establishing such a Theory for Digital Libraries could have similar effects in our domain, and might help to escape from the inertia in research that is perceived by some observers. A concrete example might help to emphasize the issue at stake: could it be that the reason that the DL field has so far delivered only one widely agreed upon protocol (the OAI-PMH) is related to the fact that we have not yet agreed upon and named the core components of a distributed DL environment and hence are unable to start defining protocols for their interaction?

(2) DL Context - The DL field largely builds upon Web technology and infrastructure [which obviously is a sensible thing to do if only in light of sustainability], yet it does not seem to have a significant impact on their evolution. This seems to be true at the level of applications/services where very few DL research results have been able to transition to a product phase, and even less have evolved into sustainable offerings, commercial or otherwise [Google being a very prominent counter-example]. The lack of impact of the DL field also seems to exist at the level of defining essential building blocks for the evolution of the Web infrastructure. Again a concrete example may help: while the DL field agrees that unique persistent identifiers/names for objects/concepts/formats/etc. in DLs are crucial, the manner in which such identifiers can be integrated in the Web infrastructure does not seem to be informed by concerns of the DL community. I am unsure as to what the actual reasons for the perceived situation are, but I do believe the issue needs to be addressed if we want to make DLs an integral part of the Web, and the DL community a respected partner in efforts aimed at the further evolution of the Web.

A similar consideration can be made about the interaction of DL research with other domains that deal with conducting or communicating scientific endeavors using the digital network infrastructure. For instance, to which extent do relevant concepts and results from DL research penetrate into the realm of grid computing development, efforts dealing with mass storage of content/datasets, e-learning R&D, and vice versa? And how can we ensure that such efforts move in a common direction, which will eventually be essential if a fully integrated networked environment for scholarship is to emerge? As is the case with the aforementioned relationship to Web technologies, it seems essential that the DL community be an equal partner in efforts targeted at such integration.

Hence, the fundamental question is how the DL community can be moved into a position that provides adequate guarantees with respect to such interaction. While the suggestion might be totally off base, it occurs to me that in order to give the DL community a strong enough voice in such interactions, the creation of one or more centers of excellence [like the ones the NSF has created for e.g. nanotechnology] might be part of an answer.

(3) DL Ecology - The sustainability question is on everyone’s lip and it is becoming obvious that it needs answering. Attempts to answer it should not only be about the identification of new business-models, but should include a fundamental questioning of the emerging DLs ecology that organizationally, to a large extent, is a direct transposition of the geographically constrained pre-digital era: a library is an island [peninsula] that provides each and every service. Shouldn’t we question whether an ecology based on the provision of redundant services, as was necessarily deployed by traditional libraries, remains optimal in a networked environment? Could we not, for example, think of an ecology based on distributed service provision: nodes specialize in specific tasks and exchange their services for those of nodes with other specializations? The OAI-PMH model is a very humble illustration of the idea: some nodes [data-providers] focus on storing an exposing metadata, others [service-providers] focus on doing something relevant with it. Others models that are conceptually related are the LoCKSS project, the QuestionPoint Collaborative Reference Service, the Seti project, etc. These projects urge us to think about the feasibility of a different ecology, in which a global network of interacting nodes becomes a robust and persistent global DL that provides a wide range of services.

(4) DL Rights Framework - The thoughts introduced in the above section can be extended to content residing in the DL Ecology: it is high time for serious efforts aimed at deploying an environment in which scholarly assets (technical reports, preprints, research papers, datasets, simulations, metadata records, etc.) behave in a manner that more closely matches the “gift exchange” spirit of scholarship. [Keep reading, as this is not yet another “ditch Elsevier” serenade.] As James Boyle brilliantly advocated in his JCDL 2003 keynote, it is time to think about this problem from the perspective of what we actually loose by sticking with the current paradigm? That loss is far from being purely financial [although one may hardly think about the joint budgets for journal acquisitions of research libraries worldwide, and what could be done with the budget of just 1 year]. The loss is also about the enormous constraints on the allowed usage of such assets. [Again, keep reading, because this is not about the legendary professor not being allowed to make 20 copies of her own paper.] Rather this is about the inability to freely process scientific assets in order to extract knowledge from them, attach knowledge to them, mine them, evolve them, build on them, etc. This is also about the complexities that we are forced to build into our systems in order to accommodate the existing rights framework. We need to think beyond research papers [although the situation is severe enough there], and anticipate that datasets, simulations etc. are endangered to go down the same path, and actually in some lucrative disciplines already are. Considering that we cannot change the past, we must understand that it is never too late for a new start, especially in light of the enormous annual growth of scholarly assets. .

(5) Binding Scholarly Assets - I perceive a serious shortcoming in the existing scholarly communication mechanism, which I need to explain by a very simple example:

At a certain point, a scholarly paper makes its public appearance in the system as an electronic preprint. Next, it gets peer-reviewed and published in a journal. Then some A&I database providers publish a metadata record describing the paper. Some scholars read the paper, build on it and hence cite it.

Unfortunately, the scholarly system does not record an unambiguous trace of these actions nor of their nature. This is actually true of most value chains that scholarly assets go through: there is no unambiguous, recorded and visible trace of the evolution of a scholarly asset through the system, nor of the nature of the evolution.

The results of this are quite disturbing, and I will illustrate a specific instance of the problem using the simple example:

Services need to go through enormous pains to computationally derive relationships between the preprint, the journal publication, the metadata records, and the citations after the facts. For example, in order to understand that the publishers’ metadata, the A&I metadata, and the metadata contained in the citations refer to the same work, complex, and computation-expensive processes need to be run, which actually rarely lead to perfect results, and which only a few services can actually perform because of the access rights required to do so.

Through the above example, the problem can be misread to be one of computing power, algorithms and access rights. It actually is not. The problem is that relationships, which are known at the moment a scholarly asset goes through a step in a value chain, are lost the immediate moment thereafter, and in many cases are forever lost. The actual dynamics of scholarship, the interaction/connection between assets, authors, readers, quality assessments about assets, scholarly research areas, etc. are lost and are extremely hard to recover after the facts. A whole body of information that could be crucial for the development of knowledge is lost. Therefore, the establishment of a layer underlying scholarly communication – a Grid for scholarly communication - that records and exposes such dynamics, relationships, and interactions is desired. In that Grid, many types of arcs expressing many different value-chain-based relationships interconnect assets and actors. Such a Grid would be instrumental in the derivation of knowledge and understanding from the scholarly assets it binds. It would provide the essential infrastructure required for the extraction of new metrics to assess the quality of scholarly assets, their appropriateness in a given research context, and for the evaluation of the performance of actors in the scholarly system. Such metrics are crucial to avoid information-overload and to pave the way for a different scholarly communication system.

The citing author drags an icon representing the cited paper to his own draft; the author’s software records metadata about the cited paper, and its “cited by” relationship with the draft and its author. The citing author posts his paper to the institutional repository, which extracts and exposes all contained metadata, both descriptive and relational. A robot operated by a publisher checks the institutional repository for additions, and selects the preprint to be considered for peer-review and publication. Around the same time, another robot operated by one of the organizations that deliver citation-based services harvests the exposed citation information, and adds it to its collection, keeping track of its origins. Another robot comes by to collect descriptive metadata to add to its scientific altering service. As alerted scholars start reading the preprint, the institutional repository records and exposes the usage – a (probably anonymous) “read by” relationship. A service that provides several quality metrics about scholarly papers collects the usage information. An archiving robot comes by to collect the preprint and all data related to it, archives it, and exposes the fact that it did so to the Grid …