June 15 - 17, 2003   
Wequassett Inn, Cape Cod   
Chatham, Massachusetts   
NSF/JISC Workshop
 
General
Welcome
Background
Agenda
References
Important Dates
Participants List
   
OUTREACH
China 2004
Bangalore 2005
   
For Contributors
Call For Papers
Papers
Breakout Reports
Final Report
Opening Plenary Session
Supplementary Contributions
   
For Participants
Expense Form
Accommodation
Tourist info
Travel
   
Organization
Sponsors
Contacts
   

 

   
Papers  
   
Research Issues for Digital Libraries
 
   
   

David A. Forsyth, Robert Wilensky, UC Berkeley
Download: PDF Version    WORD Version

We offer a number of research topics that we feel are worthy of attention. Of course, many other problems, notably, preservation, data integrity and data provenance, are far from being solved, but believe these are at least better recognized. We focus here on some problems that we feel have received insufficient attention thus far.

Living Without Metadata

Attempts to produce large, useful digital libraries run into key problems associated with metadata. Typically, objects are either missing key metadata or the metadata is difficult to interpret—it has been supplied according to a different schema or the interpretation with respect to a schema is unclear. We note that the most successful tools, e.g., web indices like Google, work as well as they do precisely because they do not rely on any metadata whatsoever. Of course, this approach undeniably limits usefulness.

We see two important possible research directions arising.

The first is building an understanding of methods to translate metadata between schemas. Current strategies for manual translation founder in a morass of detail because they aim for accuracy. However, an inaccurate system may still be useful, particularly when there is no practical alternative, particularly when the volume of data is very large, such as when one is attempting to federate a set of collections. These observations suggest studying various statistical methods to build lexicon-like structures that can automatically align different types of metadata annotation.

The second is to explore various methods for obtaining metadata that are absent from an existing digital object. The current approach—in which skilled personnel attach metadata to objects—is infeasible for large collections. The first natural strategy to explore involves using existing, labeled collections to build classifiers that can attach tags to unlabelled collections. This approach is viable, because there are many large collections that are (partially) labeled, sufficient that it is practical to learn and evaluate classifiers for many kinds of metadata tag.

This class of strategies is likely to be successful only for metadata that can be inferred from the object itself. For example, it may be possible to determine the names of those present from a picture, but it is likely to be impossible to determine the time and data at which the picture was taken. One can attempt to deal with this difficulty by attempting to compel creators or editors of objects to attach metadata at the time of creation or editing, but in practice, this very likely will not happen. Instead, we propose to explore lightweight metadata collection. Objects should become “sticky"; when an object is created, edited, downloaded, read, etc., it should opportunistically collect various forms of information that might be helpful in building classifiers later. This is a generalization of the approach where current digital cameras insert a date and time into picture header files; in the not-too-distant future, we can expect to find GPS information there, too.

Content Analysis for Collection and Meta-Collection Organization

The issue of content analysis of non-textual objects is well recognized, albeit still immature. However, while the issue of content analysis of images and the like for search has been the subject of substantial investigation, content analysis for organization of collections has received much less attention. We feel that this allocation of research resources is misguided, at least in the case of image collections. This is because, first, there is very little prospect of supplying search tools that can do what users of search want, and second, there are a number of other interesting activities, including, but not restricted to, browsing, organizing and data mining, that are more likely to be successful using the technology available in the foreseeable future. These activities have not been sufficiently well studied.

The reason for our caution concerning searching non-textual collection is that, according to what studies we have, users typically query images on semantics. For example, Enser’s study (Enser 2000) reveals users requesting from the Hulton-Deutsch collection (now the Hulton-Getty collection) images of hangovers, physicists and the smoking of kippers. Identifying any of these concepts is well beyond the reach of current image analysis. Enser cites as a particularly compelling example a request to a stock photo library for “Pretty girl doing something active, sporty in a summery setting, beach - not wearing lycra, exercise clothes - more relaxed in tee-shirt. Feature is about deodorant so girl should look active – not sweaty but happy, healthy, carefree - nothing too posed or set up – nice and natural looking''. Our own experience with users of image databases is, unfortunately, quite similar to this example.

Being able to perform such queries would be valuable, but difficult. However, there are studies that show being able to browse such collections is also of great value, and some reasons to believe that organization for browsing may be tractable. In addition to using metadata or textual data, one might attempt to cluster on a combination of image and text data, as Barnard et al. (2001) do. No option is clearly best at the moment, but some form of structure appears to be required, and it is at least plausible that such structures can be automatically inferred in the near future. In addition, there are many issues of mechanics, of how one must transmit, layout and display images efficiently, as well as choose what to display, that must be addressed.

There are two difficulties in organizing collections of visual material: First, it is difficult to obtain representations of what is depicted. Second, once one has this representation, organizing the material can be tricky. For example, one might be able to browse a museum collection that had been laid out in a manner that "made sense" to the user. In other applications, the organization would be relatively simple if the content could be obtained. For example, in military intelligence applications one might find imagery that showed large recent changes in where forces are concentrated. Similarly, one might examine a large reference collection of digitized images of Buddhist art for trends in the depiction of the human figure across space and time. These examples all depict uses of image collections that are not primarily about image search.

Data mining image collections is another example of a useful organization of collections. For example, it may be possible to build catalogs of objects, such as collections of images of a given individual, by automatic means. The number of applications of such derivative products is considerable.

Collection Characterization and Federation

Content analysis for organization related to our previous point. It may facilitate the federation of collections along lines that haven’t been anticipated by the collections’ respective creators. Moreover, it is related to another topic that has received scant attention, the automatic characterization of collections. Prior work has focused on the suitability of a collection for the purpose of satisfying a textual query. This avenue of work has not been entirely satisfying, largely to the success with which union searches can be carried out. However, there is a large set of possible tasks for which characterizing collections might prove useful. For example, if one’s goal is to find a good picture of George W. Bush, the image collection of the Fine Arts Museum of San Francisco is a poor place to look, whereas a collection of news articles is likely to be of greater utility. Being able to characterize diverse collections for the suitability of particular goals would also be especially useful when the process to be carried out is computationally expensive, and hence applying a process to a union catalog is infeasible.

Being able to characterize diverse collections for the suitability of a variety of goals appears challenging, but has not been investigated to any great extent. Hence we recommend doing so.

In sum, we suggest focusing on the technology to support content analysis for organization of non-textual collections.

Information Lifecycle Models

The shift to digitized media has thus far still largely left unchanged our traditional models of information use. Individual scholars still submit papers to conferences and journals, which still review them in the traditional manner, perhaps making the product available digitally. Our view is that most of the benefits of technology change will not be realized unless publishing, i.e., peer review, dissemination, and the use of primary and secondary data sources, is fundamentally changed from its current, centralized, linear, binary, expensive, “filter-then-disseminate” model, to a much less costly, powerful, fully distributed “disseminate-filter-collaborate” cycle, without sacrificing good organization, peer review and the like.

Progress in this direction has been slow, although we believe the realization of the importance of such a transition is clear. Progress could be accelerated by exploring further alternative models for content evaluation and filtering (to replace and perhaps majorize peer review), for distributed annotation (to support collaboration), and for distributed organization and manipulation of information substrates (to support individual and groups imposing new organizations on available material).

References

Barnard, K. and Forsyth, D., 2001. Learning the Semantics of Words and Pictures, International
Conference on Computer Vision, pp. 408­415.

Enser, P.G.B., 2000. Visual image retrieval: seeking the alliance of concept based and content based paradigms. Journal of Information Science, vol. 26, no. 4, pp 199-210.