April 17 - 19, 2007   
Hyatt Regency Phoenix   
Phoenix, Arizona   

 

Position papers

Doing much more than we have so far attempted

 
   

NSF/JISC Repositories Workshop
Donald J. Waters, The Andrew W. Mellon Foundation
April 6, 2007
Download: PDF Version  WORD version

Not long ago, my colleagues and I at the Mellon Foundation were reviewing a proposal from Stephen Murray, an architectural historian who has been developing a database of images, measurements, virtual reconstructions, and other representations of the features of several hundred churches in medieval France.  We asked Professor Murray about some changes he was proposing to make in the database and wondered how these changes would make the database more attractive and easier for a broad range of students and scholars to use in their studies and research.  He said that we had asked a “wonderful and complex question” and then referred us to a masterwork of 19th century scholarship that was serving as a model for his own work: Viollet-le-Duc's ten-volume Dictionnaire raisonné de l'architecture française du XIe au XVe siècle (Dictionary of French Architecture from the 11th to the 16th century). 

            The Dictionnaire was published beginning in 1854.  Not unlike the Mellon-funded church project, it exhaustively cataloged and illustrated medieval design and construction methods from basic structural to refined decorative techniques.  Although it was a prodigious effort in itself, building this database was not enough for the author, who feared that without further work to demonstrate its relevance and significance, the Dictionnaire would languish and fall into obscurity.  According to Professor Murray, “the formlessness of a dictionary or database gives the user no particular reason to want to use it.  Realizing this in the 1860s, le-Duc shifted to a different mode of representation—that of story-telling.”  In subsequent writings, le-Duc turned the data compiled in the Dictionnaire into compelling narratives of how medieval French architecture changed and developed, giving rise to the Gothic style.  With such demonstrations of how a rich data source could be mined and used, the Dictionnaire has since been deeply influential in driving the scholarly understanding of French medieval architecture as well as revivalist movements among architects.

             Professor Murray then went on to say in answer to our usability question about his own database that “without the mechanisms of the narrative the user will remain unmotivated.  The problem of the cultural and geopolitical context of architectural production is vast and the user really needs an interlocutor or a teacher to lead them on.  The database cannot fully provide what a good teacher can—but it can do much more than we have so far attempted.”[1]

            The purpose of this workshop on repository support for data–driven science seems to me to be perfectly aligned to Professor Murray’s ambition to “do much more than we have so far attempted” with vast arrays of data.  My task in the following comments is to examine this objective from the perspective of scholarly communications, which includes the processes by which scholars generate, record, report, preserve, and disseminate their knowledge-building activities for each other and the larger society.  What are the key factors and how might consideration of these factors affect what more could be done in designing systems for the support of data-driven science?  I turn first to the definition of data-driven science  

Data-driven science.  Today, the broadening deployment of computer-based data conversion and capture instruments and sensors has certainly expanded the scale of humanistic and scientific data to be digested.  Scientists are thus confronted with information overload in the form of vast arrays of data that have been generated from vacuum-cleaner surveys of, among other things, the galaxies, physical phenomena on earth, and the molecular composition of organic life.  Because these data arrays generally  do not lend themselves easily to controlled experiments or the application of theory, some scientists have suggested that a new paradigm called data-driven science is emerging—or needs to emerge—for the discovery of new knowledge.  Instead of depending on theory and experiment, such a paradigm would depend heavily on the data-mining, pattern matching and simulation capabilities of high-performance computing.[2]   

            One might reasonably be skeptical of such a sweeping claim of novelty just as one might observe that nearly every generation complains that it is uniquely burdened by information overload even as they develop and adapt tools and methods to bring the problem under control.  However, even if these experiences of and approaches to data are in fact new to scientists, they are all too familiar to any humanist who has tried to fathom the depths of an historical archives, comprehend the semantics of an unknown language, describe the social rituals of a foreign culture, reconstruct social practice from archaeological remains, or with Professor Murray and Viollet-le-Duc before him interpret the context of architectural production in the context of what Murray called “a vast cultural and geopolitical context.”  Deep dives into formless data, pattern-matching to organize and render those data more manageable, and story-telling, which after all is at the core of all simulation, are stocks in trade of humanistic scholarship. 

            In other words, as the anecdote from architectural history makes clear, data-driven scholarship is not such a new paradigm and may offer a deep methodological basis for collaboration between the scientists and humanists.  What is new, however, and what represents at least in part the challenge and opportunity of “doing more than we have so far attempted” is the formalization of very traditional interpretive activities of data-mining, pattern matching, and story-telling or simulation in powerful algorithms that represent large and complex sets of data in terms of multiple features and variables that can be analyzed, tested, replicated, and changed at the scale and speed afforded by advanced computation.  The promise of these new automated capabilities is that new knowledge can be created in ways that were not previously possible.  To achieve this promise a dependable, deeply scaled, and flexible repository infrastructure is needed for managing the data, serving them up for various analytical and synthetic tasks, and capturing the outputs of such scholarly work.  This infrastructure depends in part on the larger context of scholarly communications, and here I limit my focus to the implications of three aspects of that larger domain: the general qualities of the analytic process itself, its grounding in discipline-based culture, and its economic and legal underpinnings.

The analytic process.  One of the fundamental building blocks of scholarly activity is search—for both basic discovery and more complex analysis and synthesis.  Over the last fifteen years, search has been a growth industry, and search for discovery has become almost a commodity item, having been the subject of intense investment and development by Google, Amazon, Yahoo, Microsoft, and their predecessors and competitors.  Search is effective as a discovery tool, however, only insofar as a sufficiently rich body of sources is comprehensively aggregated to be worth searching.  Successful aggregation of sources at scale is the unsung hero of the success of the search engine industry, and if we are looking for models of advanced repositories, it is to the Googles and Amazons that we must surely look, not just for how they have stored these aggregations, but also for how they have been gathered in disparate formats from multiple sources operating under a variety of business models and intellectual property regimes, and then normalized and indexed for rapid delivery.

            But search for discovery is only the beginning of the scholarly process.  Scholars then must zero in on the subsets they have found—the primary and secondary source objects of interest to their work.  They need to pull together these selected subsets for deeper analysis.  The process of aggregation at this stage is more difficult and complicated because data need to be scrubbed, normalized, and prepared in a more rigorous fashion than is likely to be necessary or affordable to the commodity search engines.  Provenance and authenticity of the information needs to established; rights cleared, and databases and database schemas created; textual objects may need to be translated and marked up for grammatical and structural features as well as semantically according to certain knowledge structures; numeric data may need conversion to common measures; and assumptions and guesswork throughout need to be carefully documented.  Over centuries of data-driven work in the humanities, such processes were codified and standardized in the hands of what became commonly known as documentary editors.  Today, given the amount of data, the more these processes can be automated, the better, and functions are increasingly regarded as “curatorial” rather than “editorial” in nature.  The main difference between the two designations seems to be that a documentary editor engages in active and largely manual tasks, while the data curator tends to take a more hands-off role, and instead presides over an increasingly automated set of transformations. 

            More research is desperately needed, however, to improve the accuracy and reliability of automated data preparation or “curation” processes.  In addition, specifically with respect to repositories, there is a two-fold challenge in this area of data curation.  First, the rapid and seamless preparation of data requires standard protocols and interfaces for moving either the information or a processing engine in and out of repositories.  Second, much of the preparation or curation work creates intermediate representations of data.  As large scale analyses in a variety of projects have shown, including the MONK and Nora projects[3] for textual analysis at the University of Illinois, these intermediate products may themselves be worth saving for purposes of experimental iteration and replication, and research is much needed to identify when and under what conditions they are indeed worth saving

Discipline-based culture.  Although one can identify important challenges associated with the general features of the scholarly process as it moves from discovery to data preparation and analysis, it is also necessary to recognize, as numerous studies have shown, that significant differences in scholarly practice exist among disciplines and fields of study.  It is within the disciplines where the pressures to innovate and advance knowledge are greatest, but the investments in automation are highly uneven, with some fields bursting with energy and creativity and others operating within relatively static paradigms.  At Berkeley’s Center for Studies in Higher Education, former University of California Provost, Jud King, and his colleague, Diane Harley, recently concluded one of these studies focused on the promotion and tenure decision-making process.[4]

            King and Harley show that recognition of innovative computationally-based forms of scholarship and publication occurs slowly in general, but that variation is greatest at the discipline level.  Moreover, and perhaps more importantly, they make the useful distinction between formal and informal modes of communication and observe that the formal modes, such as publication in peer-reviewed books and journals, tend to be most deeply resistant to change.  After all, the formal means of establishing scholarly credentials are the basis on which institutional position, rank, and salary is allocated, and few scholars are prepared to take significant action that would disrupt their means of livelihood.  King and Harley go on to observe, however, that the informal realm is where scholars work with each other on a daily basis, consulting with one another, letting each other know what technique worked and did not and what new discoveries they have made.  In this informal realm—at the edge of the reputational and promotional system, where credentials are being formed rather than fixed—innovation is easier and more likely to occur.

            The important distinction between formal and informal modes of scholarly communication helps explain why the physics ArXive, to which all high-energy physicists routinely deposit their papers continues to exist along side of rather than, as some have promised for almost two decades that it would, replacing a publication system to which they also routinely submit their papers: one is an informal mode of communication and the other is formal.  The innovative automation of the preprint process in the ArXive in the early 90’s was built on a stunning ethnographic insight about the informal scholarly communications process in physics, and it has been usefully extended to other fields in the sciences and the social sciences where there have been informal traditions of circulating preprints and working papers.  Little innovation has occurred in this area since initial breakthrough and, as Paul Ginsparg reported at Rice University’s De Lange Conference in March 2007,[5] even the code base for the ArXive system has changed little since the mid-90s.  Real innovation in scholarly communications is now occurring elsewhere in the formal and informal systems of communications, and continued attention to the potential interaction between pre-prints and formal publication threatens to divert resources from other areas where they might be needed and better invested.

            From the perspective of repositories and data-driven scholarship, perhaps the most important opportunity is to focus on the construction and curation of data sets.  It may be sufficient for funding agencies and journal and book publishers to mandate that original datasets on which new publications are based be deposited and maintained in publicly accessible repositories.  However, there are some fields that are thinking even more innovatively and are trying to build peer-review systems around the data so that they can be judged formally on qualities of coherence, design, consistency, reliability of access, and so on.  With JISC support in the UK, scientists and professional associations in the field of meteorology have joined to establish a new kind of electronic publication called a data journal, where practitioners would submit data sets for peer review and dissemination.[6]  With Mellon support, in the field of nineteenth century literary studies, Jerry McGann at the University of Virginia, has organized scholarly societies into a federation for the purpose of providing peer review for data in the form of online documentary editions of nineteenth-century authors.[7]  And Bernard Frischer, who is a specialist in online virtual reconstructions of archaeological sites, has received NSF support to plan a journal-like outlet that would provide peer review of virtual reconstructions.[8]  More research and experimentation with forms of peer-reviewed data could have significant impact in helping organize the field of data curation, provide additional information for promotion and tenure committees, and avoid wasting resources in a frontal assault on a long-established and, by many accounts, still highly valued system of formal publication.

The economy of openness and intellectual property.  Another area of vigorous innovation is also occurring on the edge of the traditional and formal system of scholarly publication that will deeply affect sustainability of new forms of data-driven scholarship.  For such scholarship to thrive, there is little question that data and other forms of content need to be open, in the sense that economic, intellectual property, and other barriers must be low enough to permit an easier flow of information into rich computational environments.  In many fields, these barriers are falling with innovations such as embargo periods, moving walls, toll-free access, and special forms of license.  Even the major publishers, like Elsevier, Wiley, and others are actively participating in these activities, opening their published content.  However, these publishers are not innovating in the direction of greater openness for its own sake, but to advance innovative new business opportunities built precisely around new forms of data-mining and other services that depend on open content.  The principle of openness thus is crucial in the formation of public policy for scholarship and may be necessary for new forms of sustainable businesses to emerge that support scholarship, but simple advocacy of openness for its own sake is not necessarily sufficient or wise.  Here are a few examples of how focused research in this area of rapid innovation could deepen and sharpen our thinking about economic and intellectual property policies.

            First, let me draw your attention to complex intellectual property issues associated with the arrangements between libraries and commercial entities such as Google, Microsoft, Proquest and others.  Peter Kaufman and Ithaka made a useful attempt in 2004 to analyze the large variety of types of arrangements, some of which involve secret deals and do not always articulate coherent and collective educational and public interest objectives.[9]  Additional work is especially needed especially on the IP issues associated with emerging commercial services that will likely make use of open access materials.   

            Indeed, sophisticated publishers are increasingly seeing that the availability of material in open access form gives them important new business opportunities.  That is, they can begin to incorporate and recombine materials that they and other publishers have produced with data and other related materials in sophisticated databases, subject them to sophisticated search, data mining, and semantic algorithms, and then present these as services to a variety of specialized audiences willing to pay for the added value over and above the original content.  These may be desirable outcomes in the end, and certainly present opportunities for useful partnerships among scholars, libraries, and publishers.  However, what is worrisome about many arguments in favor of open access is the lack of strategic thinking about how open access material will actually be used once it is made available, and the faith-based assumptions that only beneficial consequences will follow from providing open access. 

            One worry is that open access to traditionally published monographs and serials will cannibalize the sales of smaller publishers, pushing them into further decline, and make it difficult for them to invest in ways to help scholars select, edit, market, evaluate, and sustain the new products of scholarship represented in digital resources and databases.  The bigger worry, which is hardly recognized and much less discussed in open access circles, is that the large, heavily capitalized publishing firms will exploit open access repositories, cherry-picking the most valuable open access products, combining them with the most valuable new databases and resources, and selling them back to the academy at a significant profit, while chasing out sources of capital from within the academic community that are desperately needed to advance scientific, humanistic, and social science study. 

            Policy-oriented research is needed that is to move past simple advocacy of open access and the trendy, glitzy rhetoric about the initial step of making materials freely available, and focuses strategically on the full life cycle of scholarly communications.  Hard questions have to be asked: open access for what and for whom and how can we ensure that there is sufficient capital for investment in the dissemination of new and emerging forms of scholarly output?  In the software arena, a variety of alternatives have been explored and articulated in the form of open source licenses, some of which facilitate desirable downstream activities and others do not.  For content, options like those afforded by the Creative Commons licenses are important to consider, but now with wider use, conflicts are beginning to emerge from the different forms of license, and especially around the different interpretations of commercial and non-commercial use.  There are no magic bullets in a highly pluralistic world, and these conflicts and their potential solutions need to be much better understood.

References

  1. E-mail correspondence from Stephen Murray to Suzanne Lodato, May 4, 2007, quoted by permission.
  2. See for example the testimony of Raymond L. Orbach, Director, Office of Science, U.S. Department of Energy, before the U.S. House of Representatives Committee on Science, July 16, 2003.  Available at http://www.science.doe.gov/Sub/speeches/Congressional_Testim/7_16_03_testimony.htm.
  3. See Nora project description as well as reports and publications at http://www.noraproject.org.  Information on the MONK project is forthcoming at http://www.monkproject.org
  4. Diane Harley, Sarah Earl-Novell, Jennifer Arter, Shannon Lawrence, and C. Judson King, “The Influence of Academic Values on Scholarly Publication and Communication Practices,” Research and Occasional Paper Series, CSHE.13.06 (September 2006), Berkeley, CA: Center for Studies in Higher Education.  Available at http://cshe.berkeley.edu/publications/publications.php?id=232.
  5. Paul Ginsparg, “Read as We May,” presentation at the De Lange Conference on Emerging Libraries, Rice University, March 6, 2007.  Webcast available at http://webcast.rice.edu/webcast.php?action=details&event=985.
  6. Dr Alan Gadian , Principal Investigator, The Overlay Journal Infrastructure for Meteorological Sciences (OJIMS) Project.  Available at http://www.see.leeds.ac.uk/research/ias/dynamics/current/ojims.html.
  7. Bethany Nowviskie and Jerome McGann, NINES: A Federated Model for Integrating Digital Scholarship, September 2005.  Available at http://www.nines.org/about/9swhitepaper.pdf.
  8. The SAVE (Serving and Archiving Virtual Environments) project.  See http://www.iath.virginia.edu/save/.
  9. Peter B. Kaufman, “Marketing Culture in the Digital Age: A Report on New Business Collaborations Between Libraries, Museums, Archives and Commercial Companies, August 25, 2005.  Available at http://www.intelligenttelevision.com/MarketingCultureinDigitalAge.pdf.