April 17 - 19, 2007   
Hyatt Regency Phoenix   
Phoenix, Arizona   

 

Position papers

Repositories, Cyberinfrastructure and the Humanities

 
   

NSF/JISC Repositories Workshop
Gregory Crane, Tufts University
April 16, 2007
Download: PDF Version  WORD version

A report, funded by the ACLS and the Mellon Foundation, supports the creation of a cyberinfrastructure for the humanities and social sciences.  Two interrelated questions now arise.  First, what should such a cyberinfastructure look like?  Second, how will such a cyberinfrastructure affect the role – and the impact -- of humanities within society as a whole?

The second question is important because we in the humanities currently lack the resources to create the infrastructure that we will need.  Even if we build on the infrastructure that the well-funded scientific disciplines create for themselves, bridging the gap between that infrastructure and the needs, both present and emergent, of the humanities will require substantial investment over the many years – we need to reinvent in digital form an intellectual life shaped by print.  If we are to attract the material resources that we need to pursue our most advanced and challenging work, we need to redefine the relationship between academic humanists and society as a whole.  Thus, even if our interests are wholly focused upon specialized research, we need a cyberinfrastructure that extends the intellectual reach of expert and novice.  In this, professional academics serve their own intellectual interest, for, however sophisticated we may be in our own specialties, we will always be novices in most areas.  If humanists argue that we train our students to think critically, then classicists, for example, should have the critical skills necessary to work with classical Chinese in a developed cyberinfrastructure.  

We have good initial data to address the first question.  We cannot predict what form a mature cyberinfrastructure will assume over the coming years, but we know very well some of the basic services that such a cyberinfrastructure must contain.  Our predictions are based on hind-sight: support from the IMLS, NEH and NSF has allowed us to build versions of each service and made some of them available as standard features on a public digital library.  What we propose thus reflects technology that is already available and for which an audience exists.  The services outlined below need to shift from research and development and to become part of established infrastructure.  They therefore constitute a minimal set of operations and should be a part of any repository that serves the humanities. 

Four basic classes of service emerge:  1) catalogue services identify the discrete objects within a collection (editions of Vergil’s Aeneid, books about Vergil);  2) named entity services identify semantically significant objects embedded within collection objects (references to Vergil or the Aeneid within other documents;  3)  customization and personalization (given a particular passage of the Aeneid, what would be of interest to an intermediate student of Latin vs. a professional Latinist?);  4) structured user contributions (e.g., users tell the library that a particular word in a passage of Vergil has a particular sense or plays a grammatical role in the sentence).  Summarization, visualization, machine translation and other technologies all play roles within one or more of the service layers above.

1. Catalogue services:

Generations of librarians have provided a foundation on which to build but we must go further than traditional catalogues.  The Functional Requirements for Bibliographic Records (FRBR) data model is an important step forward, for it provides an elementary framework within which we can begin to represent some of the basic knowledge structures that experts have developed to describe texts.  A canonical work such as the Vergil’s Aeneid has appeared in hundreds – and probably thousands -- of versions, all of which strive to represent a single edition (the text that Vergil left at his death) but errors crept into subsequent copies and each attempted reconstruction may differ from every other version ever produced.  The Aeneid has been translated into dozens of languages, with each translation based on one or more editions.  The Aeneid has attracted commentaries – documents that contain annotations about particular word, phrases and sections of the Aeneid.

Figure 1: Information about a canonical chunk of text: Thucydides, book 1, chapter 86.  Note that display integrates access to content and a catalog of other versions of, or relevant to, the same chunk of text.

The FRBR data model allows us to identify and organize all editions, translations, commentaries, indices, and other documents focused on Vergil’s Aeneid.  But we need deeper granularity than FRBR’s manifestations of expressions of a work.  Scholars have established canonical citation schemes so that they can describe the same chunk of text as it appears in many different editions.  Few students and fewer scholars actually want all information about the Aeneid or any heavily studied canonical works of literatures – such works are almost fields unto themselves and no one can read, much less digest, all that has been written about them.  In our day-to-day work, we examine subsets of these texts.  We might adopt a breadth-first approach and examine a topic that runs through the text – e.g., a particular word or image or theme.  Or we might focus in depth on a passage and explore many different themes relevant to it.  In each case, we are looking at defined subsets of these documents.

Scholars have established canonical citation schemes as coordinate systems to map their texts.  Figure 1 resembles a standard text display but it illustrates, instead, the results of a minimal catalogue to a modest collection.  The user has not request information about Thucydides’ History of the Peloponnesian War but about Thucydides, History of The Peloponnesian War, book 1, chapter 86.  Notice that the text includes numbered sections as a third level of granularity.  The users could drill down and select one of these sections as the object of interest.  A mature system should be able to catalogue information about every word and every combination of words within and across each canonical chunk of text.

In the humanities, catalogues thus need to include not only books but the canonical documents within books.  We need a catalogue that manages the canonical citation schemes and can extract from an open set of documents, versions of and information relevant to the same logical container.  We also need intelligent version analysis and visualization within our catalogues:  given N editions of a work, how does each edition relate to those which precede it?  Which editions were most influential?  What (if anything) is different in a new edition?

Data sources for cataloguing

Cataloguing thousands of citations in hundreds of editions and translations of canonical reference works by hand is not practical.  We must depend upon automatic alignment, cross language information retrieval, and markup projection from one text to the other.  To drive these processes we should have at least one carefully transcribed version of each canonical text in each major citation system.  These base texts can then serve as the anchors around which to discover the many other editions, translations, and commentaries that will surface in very large, emerging collections and then to align these documents to a common citation scheme.

2. Named entity services

We may for the sake of argument assume that catalogues provide access to well-defined objects within a collection.  We also need to able to locate references to, and then summarize information about, named entities that appear within the contents of our collections.

Named entities can be documents (e.g., references to Thucydides’ History of the Peloponnesian War), citations within documents, people, places, organizations, events and the other topics for which we consult catalogues, encyclopedias and gazetteers.  They also include linguistic topics as well:  the word facio is a dictionary heading for the Latin word “to do, make” and is thus a named entity that integrates inflected forms such as fecisset, factus etc.  Every word sense in a dictionary and linguistic phenomenon in a grammar is a separate named entity.  Every subject heading or topic to which we assign a label is a named entity.

Figure 2 -- Named entity identification: places from a chapter of a book about the US Civil War.

Figure 1 also includes a list of place names automatically extracted from the text on the left and linked to places in world.  The results in this figure illustrate a technical challenge:  three of the four places are incorrectly identified because proper nouns are semantically ambiguous (e.g., Mede – an ethnic name in Thucydides – is also a place name) and place names can describe many different places (there is a Sparta in Canada and an Athens in Alabama).  In practice, place names are relatively easy to find and identify in classical texts (the normal success rate is c. 95%).  Figure 2 illustrates place names extracted from a chapter on the American Civil War as plotted using Google maps.  In January 2007, Google released its own service to map places from digitized books in Google book search.  The goal must be for customize the Google results, making it possible to substitute more accurate services.

Data Sources for Named Entity Identification

These include language models calculated from unstructured articles about particular entities, structured data extracted from print gazetteers, machine readable dictionaries, and other existing knowledge sources, born digital resources such as WordNet, and labeled training sets (which may be lists of passages where named entities are tagged to a high degree of accuracy and which may in turn be mined from print indices).  Reference works from print thus are capital resources in a digital library, providing the foundational data for many of the higher level services on which intellectual life depends.  Automatic clustering and discovery of entities are crucial instruments but unlikely to provide the best results on their own.  Converting print information about the past into machine actionable knowledge is the greatest task that the rising generation of humanists confront.

3. Customization and Personalization

Once we are able to identify most of the objects and named entities in our collections, we need to use this information to increase intellectual, as well as physical, access.  In print libraries, a book in Greek is useless to a reader who has not studied Greek.  In a modern digital library, machine translation and a host of translation aids should provide basic access to the novice with no Greek and to extend the capacity of those studying the language at all levels to draw meaning from the text.

Figure 3 illustrates a simple approach to customization of vocabulary.  The user has developed a profile based on his or her text book of Latin.  The system automatically compares that profile against the words it detects in a given page, then identifies which words the user probably has and has not encountered before.

Figure 3 – Customization.  The digital library recognizes that the user has encountered 54 of 115 dictionary words in a given passage

The example above is fairly simple but the underlying principle is fundamental.  The system asks (1) what it knows about its own contents, (2) what the user already knows, and then (3) customizes the results for the immediate needs of this particular user.

Data sources for customization and personalization

We need profiles with structured data representing what and when users have encountered particular topics.  Named entities are a natural starting point because we already assume services to identify named entities.  We also need log data from which we can identify usage patterns.  We need recommender systems similar to those familiar from Amazon and other e-commerce sites (“users who book book A also bought books B and C”) but applied to academic issues (e.g., “readers who looked up words X, Y, and Z, also were interested in words M, N. and O”).

4. Structured User Contributions

We need not only new methods to acquire traditional publications but also much more granular contributions: e.g., “bank” in passage X represents “river bank” rather than “financial institution”;  Washington in passage A is Washington, DC, but George Washington in passage B.

Figure 4 – Automated systems have enumerated all possible morphological analyses of a given form and then ranked their probability in a given context.  Users can then vote on what they think the correct interpretation is.