Kevin Kiernan, University of Kentucky
Download: PDF Version WORD Version
Although there are some signs of change in the air, it is
fair to say that most information and computer scientists see
little benefit in including humanities scholars in the post-digital
library research agendas addressing such momentous topics as
homeland security. In this conception, interdisciplinary work
applies only to scientific disciplines. It is a dangerous mistake
to conceive of knowledge as exclusively scientific and of digital
libraries as mere repositories of knowledge. The disciplines
whose expertise encompasses language and literature, history
and philosophy and religion, film, art and music, cultural
studies, psychology and political science and sociology, anthropology,
geography, architecture and archeology, can even make critical
contributions to homeland security.
Because they work closely with the relevant cultural and historical
sources of digital images, scholars in the humanities know
perhaps better than scientists that automated searches of raw
digital images are never going to find most of what is important
in them. Humanities scholars also know that the automatic insertion
of searchable metadata is an inadequate measure for providing
comprehensive access to the knowledge that digital images potentially
hold. Although there are many kinds of metadata, for humanities
scholars the most important will be those that incorporate
translation, interpretation, analysis, and criticism, the digital
library equivalents of the books and articles written about
primary sources in traditional libraries.
Just as traditional libraries are not limited to primary sources,
so digital libraries should not be limited to digital images
of primary data. The important difference is that, if provided
with proactive interfaces, scholars could continually enhance
the primary data of digital libraries with searchable metadata
of essential secondary information. This explanatory information
would always be linked to the primary digital sources and always
available for accurate, comprehensive, instantaneous searching.
These fundamental observations apply to all digital resources,
but they are most obviously true of the millions of heterogeneous
hand-written materials on stone, clay, cloth, wood, canvas,
papyrus, animal skin, paper, walls, or any other medium carrying
text of any kind. A transcription of the text on these objects
is metadata, which to some extent makes the metadata the primary
data. Without the metadata, the data are in many cases (e.g.,
ancient texts, metaphorical language, unfamiliar scripts, and
foreign languages) virtually meaningless. The humanities discipline
that investigates, analyzes, and edits these textual materials
is called textual criticism.
The importance to our national security of incorporating the
methods of textual criticism into the emerging science of information
management can perhaps be illustrated by imagining the discovery
of a significant cache of al-Qaeda documents in Upstate New
York. The collection includes a training video, forged passports
and currency, birth records and drivers’ licenses; a
copy of a non-Egyptian edition of the Qur’an of unknown
date and provenance; two prayer mats (one Pakistani and one
reed, of undetermined origin); and a terrorist manual, hand-written
by at least two and possibly three people with different inks
in variant styles of Arabic script. In addition to prose, the
text frequently uses Qur’anic verses and passages of
poetry. The handwriting and demotic of one of the writers,
who made erasures, corrections, and revisions, is eccentric.
The previous owners left in haste, but tried to destroy the
most important sections of the manual (mixed perhaps with other
burned documents) by tearing out pages and setting them on
fire. Fortunately, significant fragments of text, though now
deprived of their full context, and distorted by scorching,
curling, and shrinking, survive and if restored can furnish
valuable information. A burnt, hand-painted, emblem or insignia
with an illegible inscription from the cover depicts a globe,
pierced by a scimitar, highlighting the Middle East and Africa.
Parts of the manual employ encryption; parts give lists of
chemicals; parts make very specific but obscure cultural, historical,
theological, and geographical references; parts use figurative
language in alluding to a wide range of possible targets. Throughout,
the tone is highly charged, exuding sarcasm, rage, and fanatical
zeal.
A collection of this nature is hardly far-fetched, and in
fact some have already been found. They should all be part
of a distributed digital library. What kind of information
environment, or scholarly, investigative, analytical, communication
infrastructure, is required to access these highly heterogeneous
materials? The highest level of users will be experts in diverse
areas of the sciences and the humanities, including Arabic
language and literature; Arabic script; art history; chemistry;
counterfeit currency and forgeries; criminal investigation
and profiling; dialectology; digital imaging; film studies;
formal and demotic Middle Eastern languages; geography; handwriting
analysis; Islamic history, culture, poetry, and religion; textiles;
library and information science; linguistics; literary theory
and analysis; quantum mechanics; physics; plant biology; psychology;
public records; and textual scholarship.
The heterogeneous group must have a common, shareable, infrastructure,
so that users of PCs and Macs will have access to Unix resources,
and must establish a common software architecture, so that
essential tools may be developed and actually used by all members
of the team. To make steady progress in the investigation,
concurrency control must permit all contributors to access
the database or relevant parts of it simultaneously, annotate
it with expert metadata, and update it without overwriting
or otherwise interfering with the on-going annotations of other
expert contributors. Many of the most critical contributors,
for example, the experts in dialectology or poetry, may have
few or no computer skills, and may therefore have an initial
aversion to using computers at all. They will require simple
graphical user interfaces to translate (and transparently encode)
every word in the terrorist manual, and permanently link the
translation with the part of the image on which it is based,
for reference by all other members of the team. Two or more
specialists in Arabic language should be able to gloss the
same parts of the manual at the same time and, by means of
the accruing glossarial metadata, access each other’s
possibly conflicting linguistic annotations in real time, along
with everyone else working on the project. Other hierarchies
of metadata must not conflict with one another. The Arabic
script of the images will run from right to left, while translations
run from left to right. The text must be searchable from text
or image, which are inextricably linked through the annotating
software.
Investigators and researchers will want to analyze not only
the material at hand, but also any other pertinent digital
collections relating to it. For instance, they may be able
to reconstruct lost parts of the manual by collating it with
similar manuals found in Karachi, Pakistan, Frankfurt, Germany,
and Manchester, England. Using the preserved letters as models,
they may be able to reconstruct fragmentary letters that were
shredded, burnt, or otherwise damaged, or they may be able
to locate other writing by the same people in other distributed
collections. The images from the video must also be easily
accessible, for some bearded faces from the video appear to
resemble the shaven ones in photographs on the passports and
drivers’ licenses. Easy to use tools must be available
for overlaying images and rapidly exploring distributed databases
of faces and any facts already assembled elsewhere about them.
In the digitization program it will be important to adhere
to cataloguing conventions of a metadata standard such as the
Dublin Core (title, creator, subject, description, etc.) for
organizing the materials. It is critical to digitize everything
in the highest possible resolution with a professional digital
camera enabled with software (portable to most commercial cameras)
that inserts technological metadata of the lighting conditions,
the distance from the camera, the exact sizes of the source
artifacts, and all pertinent features of the digital file.
The extremely high resolution, and the use of special lighting
sources (e.g., ultraviolet, infrared, x-ray) will make possible
sophisticated image-processing of the burnt, fragmentary, illegible
passages, the restoration of erased text, minute analyses of
the handwriting, the detection of watermarks in the paper,
and other features requiring microscopic analysis. Some of
the materials, such as the curled pages and the prayer mats,
will also benefit from 3D imaging, algorithms for flattening
the curled pages, reversing shrinking and stretching from fire-damage,
compiling fragmentary letters and using pattern recognition
to identify and restore them, examining the weave and colors
of the carpet, and so on.
Often focussing on theoretical solutions and generic proof-of-concept
issues, most computer scientists have not considered the creation
of practical, cross-disciplinary infrastructures, architectures,
and software tools, as a legitimate area of research. But in
order to achieve a truly powerful, ubiquitous, interoperable,
easily reconfigurable, information environment for all levels
of users across all disciplines, the creation of a common,
collaborative, workspace is absolutely essential. Most of the
investigators who have the most to contribute in terms of knowledge
of the data have the least technical expertise to build the
information infrastructure. The computer scientists and their
students, however, cannot build it without the close collaboration
of the heterogeneous domain experts who will use it. The rewards
for everyone of this shared information environment would be
great, and should be demonstrable at an early stage of development.
To make real headway in this emerging new area of collaborative
research it is accordingly essential to overcome the modern
segregation of science and humanities to forge new and enduring
working environments for research, teaching, and learning across
the disciplines. Some of the most difficult, interesting, and
practical research problems in computer science can be addressed
through serious engagements with broad humanities disciplines
such as architecture, art and archeology, film, history, language
and literature, and music. The structure of modern social systems,
in particular universities, with their cleavages between engineering
and humanities disciplines, do not make it easy to work together.
Funding agencies that recognize the importance of forging new
academic alliances that cross disciplines can do much to encourage
the development of cross-disciplinary and interdisciplinary
undergraduate and graduate curricula, leading to eclectic degrees
in both computer science and humanities disciplines.
Libraries of the future must play a critical role, as well.
Digital resources must be designed from the outset for the
highest level of use, if they are intended to accommodate all
levels of users. For this reason, the grand goal of the new
science of information management must be more ambitious than
the very important but basic one of organizing, storing, and
managing endlessly growing amounts of digital images for universal
remote access. The most useful digital resources must devise
efficient and robust ways of continually incorporating new
knowledge, linking text to related images to enable continued
research, teaching, and learning, and ever-richer searching
of the primary data.
While they may at first seem to be an unduly narrow testbed,
hand-written materials in fact serve as a highly effective
generic model to represent a multilingual, multicultural, multimedia
set of problems.
Kevin Kiernan, University of Kentucky
|
|