April 17 - 19, 2007   
Hyatt Regency Phoenix   
Phoenix, Arizona   

 

Position papers

The Relationship between Data and Scholarly Communication

 
   

NSF/JISC Repositories Workshop
Sayeed Choudhury, Johns Hopkins University
April 9, 2007
Download: PDF Version  WORD version

My comments reflect experience with two projects led by Johns Hopkins University that seemingly come from opposite ends of the spectrum—the Virtual Observatory[1] (VO) and the Roman de la Rose Project.[2]  The VO represents one of the quintessential cyberinfrastructure projects, with large, complex datasets being shared and analyzed by a distributed group of astronomers. 

The Rose Project features the development of a digital environment that will include content and services related to manuscripts written (and illuminated) in medieval French from the late 13th century to the middle 16th century.  At first glance, one might assume that the Rose Project offers little insight regarding data-driven scholarly communication.  Even a completely digitized corpus of Rose manuscripts would not come close to the scale of the VO datasets.  However, when I reflect on the current—and historical—nature of both disciplines, I note an important observation regarding the relationship between data and scholarly practices and communication.

There is a widespread belief that humanists work primarily in isolation, resisting collaborative ventures with other humanists, whereas scientists work as teams, embracing opportunities to work with fellow scientists.  While there is ample evidence to support this belief in the present, when one considers the historical context of the humanities and sciences, the picture becomes more complex.

Rudolphine Tables = Open Content Alliance?

In his position paper, Michael Nelson mentions the Rudolphine Tables, making the interesting observation that they might be considered on par with the Google Book Search[3] or the Open Content Alliance.[4]  Certainly, the publication of these data inspired major advances in astronomy.  Michael also appropriately mentions that these tables might not have been published for a host of reasons including “significant infrastructure costs (in the form of purpose­built observatories), professional jealousy, intellectual property restrictions, and political and religious instability.”  Even astronomy, not too long ago, was a discipline defined by a lone astronomer who would guard her or his data with great secrecy.  In “data poor” times, it seems that scientists did not readily share data or collaborate.

By the time the Rudolphine Tables had been published, the Roman de la Rose story had been written, re-written, re-purposed, recast, illuminated, and shared many times over.  While it might seem like an unorthodox interpretation, this period represented a “data rich” time for the humanities.  Before the development of scientific instrumentation, “data” consisted of the spoken word, the written word, illuminations, etc.  And, it seems, in this relatively “data rich” environment of the Middle Ages, humanists did collaborate.  Perhaps it is human nature, rather than humanists’ nature, that defines scholarly practice.

Rather than assuming some inherent characteristics of specific disciplines define their modes of scholarship or communication, perhaps it is the relative ease or difficulty with which they can generate, acquire or process data that ultimately influences scholarship.  As an engineering student, I was led to believe that humanities materials are data poor but, in reality, they are data rich in several ways.  A single Rose manuscript contains a tremendous of amount of textual, visual and semantic content, which is difficult to extract in meaningful ways.  As our ability improves to move these data into digital format, I believe humanists will naturally collaborate.   Indeed, large-scale digitization might drive the humanities into a new age of data-driven scholarship as the Rudolphine Tables inspired astronomers.

A Moment in Time

During the ACRL Conference in Baltimore, I had the pleasure of meeting with the CLIR Postdoctoral Fellows in Scholarly Information Resources.[5]  During this conversation, Chuck Henry, the President of CLIR mentioned that scholars from the sciences, engineering, social sciences and humanities have each developed their own cyberinfrastructure studies and reports, perhaps representing an unprecedented convergence of interest.  There is no doubt, however, that the sciences and engineering are leading the way for data-driven scholarship in our current environment.

As our digital library group at Johns Hopkins has learned more about the VO, we realize that we are not facing a data deluge: we are facing a data tsunami. Having said this, perhaps the Roman de la Rose was so popular precisely because it felt like an overwhelming new mode of interaction with data.  Let me submit a controversial statement that I believe merits some discussion: putting aside obvious aesthetic differences, scientific datasets are the modern equivalents of medieval manuscripts.

Roles for NSF and JISC

There are, of course, undeniable differences in our environment given the scale of data.  Bill Arms’ position paper quotes Greg Crane: “When collections get large, only the computer reads every word.”  Rather than urging scholars to consider the “crisis in scholarly communication” or “barriers” to change, as the amount of data increases, there will probably be a natural shift toward new methods for publication, collaboration, etc. that emphasize machine readable and actionable methods.  As scientists such as astronomers lead the way, it will be worthwhile to ascertain whether cyberinfrastructure related tools, services, and systems from one discipline could support other scientists, engineers, social scientists and even humanists.  NSF and JISC can help track the portability of such resources.

NSF and JISC can undeniably influence the environment through its rewards structure.  Funding for projects that support increased data acquisition, integration, processing and analysis should be encouraged.  NSF and JISC are well placed to fund collaborative efforts within the US and UK, respectively, but joint funding programs would have the obvious benefit of bringing together collaborators with similar research or teaching goals, but different perspectives.

Finally, NSF and JISC should provide significant funding and support for digital preservation and data curation.  These essential, yet largely unaddressed, areas of support are essential for scholarly communication.  Eric F. Van de Velde’s suggestion from his position paper of Centers of Excellence in Data Preservation is worthwhile.

Imagine the loss to science and scholarship if we had not preserved the Rudolphine Tables or the Roman de la Rose manuscripts.

References

  1. http://www.us-vo.org/
  2. http://rose.mse.jhu.edu/
  3. http://books.google.com/
  4. http://www.opencontentalliance.org/
  5. http://www.clir.org/fellowships/postdoc/postdoc.html