June 15 - 17, 2003   
Wequassett Inn, Cape Cod   
Chatham, Massachusetts   
NSF/JISC Workshop
 
General
Welcome
Background
Agenda
References
Important Dates
Participants List
   
OUTREACH
China 2004
Bangalore 2005
   
For Contributors
Call For Papers
Papers
Breakout Reports
Final Report
Opening Plenary Session
Supplementary Contributions
   
For Participants
Expense Form
Accommodation
Tourist info
Travel
   
Organization
Sponsors
Contacts
   

 

   
Papers  
   
Hand-written Materials and the Science of Information Management
 
   
   

Kevin Kiernan, University of Kentucky
Download: PDF Version    WORD Version

Although there are some signs of change in the air, it is fair to say that most information and computer scientists see little benefit in including humanities scholars in the post-digital library research agendas addressing such momentous topics as homeland security. In this conception, interdisciplinary work applies only to scientific disciplines. It is a dangerous mistake to conceive of knowledge as exclusively scientific and of digital libraries as mere repositories of knowledge. The disciplines whose expertise encompasses language and literature, history and philosophy and religion, film, art and music, cultural studies, psychology and political science and sociology, anthropology, geography, architecture and archeology, can even make critical contributions to homeland security.

Because they work closely with the relevant cultural and historical sources of digital images, scholars in the humanities know perhaps better than scientists that automated searches of raw digital images are never going to find most of what is important in them. Humanities scholars also know that the automatic insertion of searchable metadata is an inadequate measure for providing comprehensive access to the knowledge that digital images potentially hold. Although there are many kinds of metadata, for humanities scholars the most important will be those that incorporate translation, interpretation, analysis, and criticism, the digital library equivalents of the books and articles written about primary sources in traditional libraries.

Just as traditional libraries are not limited to primary sources, so digital libraries should not be limited to digital images of primary data. The important difference is that, if provided with proactive interfaces, scholars could continually enhance the primary data of digital libraries with searchable metadata of essential secondary information. This explanatory information would always be linked to the primary digital sources and always available for accurate, comprehensive, instantaneous searching. These fundamental observations apply to all digital resources, but they are most obviously true of the millions of heterogeneous hand-written materials on stone, clay, cloth, wood, canvas, papyrus, animal skin, paper, walls, or any other medium carrying text of any kind. A transcription of the text on these objects is metadata, which to some extent makes the metadata the primary data. Without the metadata, the data are in many cases (e.g., ancient texts, metaphorical language, unfamiliar scripts, and foreign languages) virtually meaningless. The humanities discipline that investigates, analyzes, and edits these textual materials is called textual criticism.

The importance to our national security of incorporating the methods of textual criticism into the emerging science of information management can perhaps be illustrated by imagining the discovery of a significant cache of al-Qaeda documents in Upstate New York. The collection includes a training video, forged passports and currency, birth records and drivers’ licenses; a copy of a non-Egyptian edition of the Qur’an of unknown date and provenance; two prayer mats (one Pakistani and one reed, of undetermined origin); and a terrorist manual, hand-written by at least two and possibly three people with different inks in variant styles of Arabic script. In addition to prose, the text frequently uses Qur’anic verses and passages of poetry. The handwriting and demotic of one of the writers, who made erasures, corrections, and revisions, is eccentric. The previous owners left in haste, but tried to destroy the most important sections of the manual (mixed perhaps with other burned documents) by tearing out pages and setting them on fire. Fortunately, significant fragments of text, though now deprived of their full context, and distorted by scorching, curling, and shrinking, survive and if restored can furnish valuable information. A burnt, hand-painted, emblem or insignia with an illegible inscription from the cover depicts a globe, pierced by a scimitar, highlighting the Middle East and Africa. Parts of the manual employ encryption; parts give lists of chemicals; parts make very specific but obscure cultural, historical, theological, and geographical references; parts use figurative language in alluding to a wide range of possible targets. Throughout, the tone is highly charged, exuding sarcasm, rage, and fanatical zeal.

A collection of this nature is hardly far-fetched, and in fact some have already been found. They should all be part of a distributed digital library. What kind of information environment, or scholarly, investigative, analytical, communication infrastructure, is required to access these highly heterogeneous materials? The highest level of users will be experts in diverse areas of the sciences and the humanities, including Arabic language and literature; Arabic script; art history; chemistry; counterfeit currency and forgeries; criminal investigation and profiling; dialectology; digital imaging; film studies; formal and demotic Middle Eastern languages; geography; handwriting analysis; Islamic history, culture, poetry, and religion; textiles; library and information science; linguistics; literary theory and analysis; quantum mechanics; physics; plant biology; psychology; public records; and textual scholarship.

The heterogeneous group must have a common, shareable, infrastructure, so that users of PCs and Macs will have access to Unix resources, and must establish a common software architecture, so that essential tools may be developed and actually used by all members of the team. To make steady progress in the investigation, concurrency control must permit all contributors to access the database or relevant parts of it simultaneously, annotate it with expert metadata, and update it without overwriting or otherwise interfering with the on-going annotations of other expert contributors. Many of the most critical contributors, for example, the experts in dialectology or poetry, may have few or no computer skills, and may therefore have an initial aversion to using computers at all. They will require simple graphical user interfaces to translate (and transparently encode) every word in the terrorist manual, and permanently link the translation with the part of the image on which it is based, for reference by all other members of the team. Two or more specialists in Arabic language should be able to gloss the same parts of the manual at the same time and, by means of the accruing glossarial metadata, access each other’s possibly conflicting linguistic annotations in real time, along with everyone else working on the project. Other hierarchies of metadata must not conflict with one another. The Arabic script of the images will run from right to left, while translations run from left to right. The text must be searchable from text or image, which are inextricably linked through the annotating software.

Investigators and researchers will want to analyze not only the material at hand, but also any other pertinent digital collections relating to it. For instance, they may be able to reconstruct lost parts of the manual by collating it with similar manuals found in Karachi, Pakistan, Frankfurt, Germany, and Manchester, England. Using the preserved letters as models, they may be able to reconstruct fragmentary letters that were shredded, burnt, or otherwise damaged, or they may be able to locate other writing by the same people in other distributed collections. The images from the video must also be easily accessible, for some bearded faces from the video appear to resemble the shaven ones in photographs on the passports and drivers’ licenses. Easy to use tools must be available for overlaying images and rapidly exploring distributed databases of faces and any facts already assembled elsewhere about them.

In the digitization program it will be important to adhere to cataloguing conventions of a metadata standard such as the Dublin Core (title, creator, subject, description, etc.) for organizing the materials. It is critical to digitize everything in the highest possible resolution with a professional digital camera enabled with software (portable to most commercial cameras) that inserts technological metadata of the lighting conditions, the distance from the camera, the exact sizes of the source artifacts, and all pertinent features of the digital file. The extremely high resolution, and the use of special lighting sources (e.g., ultraviolet, infrared, x-ray) will make possible sophisticated image-processing of the burnt, fragmentary, illegible passages, the restoration of erased text, minute analyses of the handwriting, the detection of watermarks in the paper, and other features requiring microscopic analysis. Some of the materials, such as the curled pages and the prayer mats, will also benefit from 3D imaging, algorithms for flattening the curled pages, reversing shrinking and stretching from fire-damage, compiling fragmentary letters and using pattern recognition to identify and restore them, examining the weave and colors of the carpet, and so on.

Often focussing on theoretical solutions and generic proof-of-concept issues, most computer scientists have not considered the creation of practical, cross-disciplinary infrastructures, architectures, and software tools, as a legitimate area of research. But in order to achieve a truly powerful, ubiquitous, interoperable, easily reconfigurable, information environment for all levels of users across all disciplines, the creation of a common, collaborative, workspace is absolutely essential. Most of the investigators who have the most to contribute in terms of knowledge of the data have the least technical expertise to build the information infrastructure. The computer scientists and their students, however, cannot build it without the close collaboration of the heterogeneous domain experts who will use it. The rewards for everyone of this shared information environment would be great, and should be demonstrable at an early stage of development.

To make real headway in this emerging new area of collaborative research it is accordingly essential to overcome the modern segregation of science and humanities to forge new and enduring working environments for research, teaching, and learning across the disciplines. Some of the most difficult, interesting, and practical research problems in computer science can be addressed through serious engagements with broad humanities disciplines such as architecture, art and archeology, film, history, language and literature, and music. The structure of modern social systems, in particular universities, with their cleavages between engineering and humanities disciplines, do not make it easy to work together. Funding agencies that recognize the importance of forging new academic alliances that cross disciplines can do much to encourage the development of cross-disciplinary and interdisciplinary undergraduate and graduate curricula, leading to eclectic degrees in both computer science and humanities disciplines.

Libraries of the future must play a critical role, as well. Digital resources must be designed from the outset for the highest level of use, if they are intended to accommodate all levels of users. For this reason, the grand goal of the new science of information management must be more ambitious than the very important but basic one of organizing, storing, and managing endlessly growing amounts of digital images for universal remote access. The most useful digital resources must devise efficient and robust ways of continually incorporating new knowledge, linking text to related images to enable continued research, teaching, and learning, and ever-richer searching of the primary data.
While they may at first seem to be an unduly narrow testbed, hand-written materials in fact serve as a highly effective generic model to represent a multilingual, multicultural, multimedia set of problems.


Kevin Kiernan, University of Kentucky