June 15 - 17, 2003   
Wequassett Inn, Cape Cod   
Chatham, Massachusetts   
NSF/JISC Workshop
 
General
Welcome
Background
Agenda
References
Important Dates
Participants List
   
OUTREACH
China 2004
Bangalore 2005
   
For Contributors
Call For Papers
Papers
Breakout Reports
Final Report
Opening Plenary Session
Supplementary Contributions
   
For Participants
Expense Form
Accommodation
Tourist info
Travel
   
Organization
Sponsors
Contacts
   

 

   
Papers  
   
NSF DL Position Paper
 
   
   

Carl Lagoze, Cornell University
Download: PDF Version    WORD Version

The roots of the NSF digital library initiative lie in the pre-web era. An unfortunate result of this has been an artificial separation between web research and digital library research. I believe that this separation has played out to the disadvantage of the digital library community. This is ironic on two counts. First, one of the most fundamental advances in the Web, page rank that led to the Google phenomenon, is a result of NSF DLI funding. Second, the vast majority of digital library work has been enabled advances in web technologies – Java, XML, RDF – yet almost none of these technologies are the result of DLI projects.

The cost of this separation is evident in the quality of the respective research in the fields. There is, in my opinion, a disturbing imbalance in the strength of the two areas as demonstrated by the quality of their major conferences. Over the years the quality of papers at the World Wide Web Conference have increased dramatically, with exciting results in fundamental areas such web characterization, modeling, and link analysis. In contrast, in my six years as program committee for the major digital library conferences – ECDL and JCDL – I have see a disturbing decline in paper quality and a notable shortage of fresh new ideas and research directions. Results have been largely incremental with few signs of exciting new threads developing.

I propose in this paper that future funding for research in network-based information must address our current lack of basic understanding of the web; how it is structured, how people use it, how it affects social interaction, how it grows over time, and multiple other questions. I believe that our challenge is to learn how to exploit the ubiquity and phenomenal richness of the web environment and layer upon it the high-integrity contexts that we endow with the name ‘libraries’. Formal understanding the web space is essential for building scalable, economical, and sustainable digital libraries.

Before proceeding, I should more carefully define what I mean by ‘the web’. At one level the web is a set of protocols and languages – in particular HTTP, URLs, and HTML – that are in fact ephemeral. In this sense, the web is not something that will last forever and will surely be supplanted by future technologies. Yet at another level the web is a truly metaphysical phenomenon, an interlinked knowledge network independent of the technology. In this sense one can argue that the web has always existed because humans from pre-history have formed social networks that shared information and referred to each other and their concepts. The instantiation of this social web as a technical reality, the World Wide Web, marks an event heretofore unseen in history. That is, our inherent social and knowledge networks can now be accessed, monitored, and analyzed and, with the help of massive computation power, perhaps understood. A plausible, albeit possibly controversial hypothesis, is that by studying and understanding the web we can better understand social structures and the organization, flow, and development of knowledge. This is the web, that I propose as a research focus, rather than the purely technical web. In the remainder of this paper I enumerate a number of future research areas within this broad umbrella.

Many parts of this short paper derive from a recently submitted NSF ITR proposal submitted by Cornell and other institutions. As such, I owe credit to colleagues including Bill Arms, Jon Kleinberg, Paul Ginsparg, and Dan Huttenlocher at Cornell, Raimie Stata at UC Santa Cruz, Lee Giles at Penn State, and Brewster Kahle at the Internet Archive.

Understanding the structure of the web

Recent research has produced intriguing results related to the structure of the web including some understanding of its linkage patterns, the multiple communities represented in it, and its size and diameter. Formal understanding and modeling of these characteristics is essential to provide the baseline information for analyses described later in this paper. Without some notion of ‘normalcy’ in the web, it is impossible to recognize perturbations, spikes, and notable patterns that evidence something semantically meaningful or that can be classified as knowledge rather than as simple data.

Initial results in web characterization merely scratch the surface and the rapidly changing nature of the web calls out the need for a significant research effort in this area. Due to its size alone, the web challenges our current hardware capacities, mathematical models, and computational techniques. “Web-sized” experiments are difficult to undertake since relatively basic tools such as web crawling is still understand at only a basic level. These attempts to analyze the web demonstrate the need for research in improved graph modeling techniques and understanding of random effects in large-scale systems.

The base model for our understanding the web – as a graph where nodes are resources and edges are its hyperlinks – is proving too simplistic. For example, it is well-known that not all edges in the graph are equal. Factors such as nepotistic links, whereby sites are self-referential, are known to interfere with graph-based analysis techniques such as page rank. Furthermore, there is well-known but marginally understood polarity in links which might indicate recommendations or condemnations with varying degrees of strength – e.g., a scholarly citation to a ‘particularly good example’ in contrast to a paper that the citer strongly disagrees with. There are open research issues in natural language understanding of link anchor text and how to apply link polarity and its degree to basic web models and ranking techniques.

Another shortcoming in existing web models lies in the simplistic view of a node in the graph. Most work assumes that a node is a ‘document’ accessible via a URL. However, the modern web is far more complex that static HTML documents and in fact behind URLs lays complex and dynamic content and the large databases and repositories that make up the hidden web. The problems with incorporating these complex objects into a web model are similar to those that accompanied extending Garfield’s initial citation analysis work from journal influence to individual paper influence. If we intend to develop tools to automatically analyze the web we must develop models that disaggregate these complex nodes in the graph or aggregate them when necessary (i.e., into strongly connected components).

Understanding resource quality

Using the web as the basis for a managed information environment immediately introduces questions about resource quality. Our work in the NSDL has confronted these questions since we not only want to build a large-scale digital library but one where users can filter the library at varying quality thresholds. While quality assessment is certainly possible with human mediation, such mediation is economically infeasible at the scale of the information spaces we wish to build. Consequently, automated quality mechanisms are a fundamental area in web-based information research.

There is considerable initial work in this area. Research and development of annotation systems provide the basis for tools where people and agents may impose quality assessments on web resources and portions thereof. The utility of these annotations beyond simple display for human perusal remains to be studied and developed. Link-based search algorithms such as page rank and HITS attempt to infer quality by recursively quantifying links. Research leading to better understanding of the structure underlying the links, both in terms of node granularity and edge polarity, may improve these automatic quality assessments. Finally, important open questions remain about reputation systems that facilitate ‘reviewing the reviewers’. While a number of algorithms exist to assess reputation, they have proven fragile to intentional attacks and need to be improved to be a reliable basis for automated quality assessment.

The Dimension of Time

Certainly one of the interesting distinctions between bricks and mortar libraries and networked information systems is the dynamic nature of the objects and collections. Digital objects and collections are inherently mutable and there is a need for substantial work on the time dimension in digital information. This work lies in two main areas.

There is the need for work on the time dimension at the individual object level. Besides quality, discussed in the previous section, the two most critical integrity issues with web-based information are provenance – where does an object come from and how has it changed (been tampered with) over time – and persistence – how long will an object last and how to make it last longer. Recent DLI2 work provides some interesting foundation work in this area. This includes data provenance work under Peter Buneman at University of Pennsylvania and our own work on event-based knowledge models (the ABC model) and digital object risk management. The first two results focus on models for recording changes in digital objects and databases over time, while the latter aims to develop techniques for automatically monitoring web resources (sites and pages) and enforce formally expressed preservation policies. At present results from these projects are still preliminary and focused research initiatives are necessary to move them to the state where practical and reliable digital libraries such as the NSDL can be built on top of the web layer.

Perhaps more intriguing is research that addresses the flow and growth of information on the web. If, as I proposed earlier, the web indeed increasingly provides a mirror of the broad scope of human activity (cultural, social, economic, legal, intellectual), then research results in the ‘the time axis of the web’ may give us important tools for understanding past events (e.g., the development of an important intellectual idea) or even predicting future events (e.g., a potential terrorist attack). Projects like the Internet Archive and Wayback Machine lay some basis for this work by taking periodic snapshots of the web as a whole and especially high-frequency snapshots of focused sites during special events – presidential elections and September 11. In addition, Jon Kleinberg’s work on text-based burst analysis has proven effective in a number of contexts including news burst recognition (www.daypop.com) and research trend recognition (in the context of arXiv). But this work needs to be moved up to web scale and into real time (in order to monitor events as they are happening), and herein there is opportunity for considerable research. One area of particular interest is synchronization of data from multiple resources. As pointed out earlier, underlying the web as a simple document structure are databases, news feeds, weblogs, etc. Thus, burst analysis is more than simple text basis but the recognition and relation of heterogeneous events in distinct data types. Doing this at a very large scale and rapidly will require powerful new formalisms and algorithms.

Understanding and adopting to the user

An increasing number of commercial web sites increase their level of service by profiling users, manually and automatically, and adapting to their behaviors. Any user of Amazon will notice the sensitivity of search results to both global user behavior (what is popular at the global level) and individual user behavior (what have I ordered in the past). Techniques for understanding how to effectively exploit user behavior are still at their initial stages and need to be generalized at the level of academic research rather than simply as a commercial activity. Of particular interest are the means of doing this without compromising user privacy.

Automatic Classification and Organization of Web Information

At the heart of any library problem is the need to organize information for diverse purposes such as management, discovery, and browsing. Perhaps due to inertia from the library cataloging tradition, there has been far too much attention to metadata (in its human-generated form) in digital library activities. A good bit of this work is motivated by dubious or outdated assumptions (metadata is critical to resource discovery) and the results have disappointing (the quality of non-professional produced metadata is degraded to the point of non-utility) and non-scalable (human cataloging even in its most simple form simply doesn’t scale up to the web).

In response to this reality we have seen increasing interest in both automatic metadata generation and collection aggregation. Liz Liddy’s work on ‘breaking the metadata bottleneck’ is an excellent example of the former and my own group has produced notable results in automatic collection aggregation and organization. Both areas deserve substantial greater research attention, and can especially benefit from increased understanding of basic web structure as described at the beginning of this paper.

Many of the techniques for automatic metadata generation have heretofore been restricted to analysis of a textual object and inferences from its contents. The success of Google demonstrates the importance of contextual information (i.e. link structure) for ranking search results and projects such as Google image search and the CLiMB project at Columbia show how context can provide ‘aboutness’ information relative to an object. It appears that there is substantial opportunity to improve that state-of-the-art of automated metadata generation (and thereby break through a major scalability issue in information management) by understanding how to fully exploit information context for generation of descriptive information. A special benefit of this work would be descriptive metadata generation for non-textual objects. This work is related to the efforts to understand ‘communities’ in web space. These communities have multiple motivations, but the community I am proposing here is effectively an ‘aboutness community’ – the set of web sites graphically neighboring a selected web site that offer descriptive information about the respective site.

Moving beyond the object level, libraries consist of collections and users of information (both managers and consumers) are often more comfortable in dealing with information in the aggregate rather than at the finest granularity. Research work on automatic inference of these aggregations on the web can make an important contribution to our ability to make the web a more effective knowledge environment. Even the most commonly used information aggregation on the web – the notion of a ‘site’ – is poorly understood. Work to automatically discover more abstract aggregations, such as groupings into semantically-based collections or equivalence classes (where equivalence is more complex than simple bit equality) has shown some progress but needs considerable more focus before it can be effectively deployed in real information systems.

Conclusions

As stated at the beginning of this paper, digital library research has an unfortunate distinction from web research. In fact, there has been tremendous and exciting progress in the creation of a ubiquitous knowledge environment in web space that has progressed independent of DLI-funded research. The opportunity now exists for a refocus of NSF funding in this area and make substantial progress in our understanding of the web phenomenon. I look forward for discussions at the workshop on this issue.